Site Reliability Engineer

Fabric Health

43d ago

0$135k - $160kDevopsNew York, NY, USjobspy_indeed

remoteindeed

Job Description

**Site Reliability Engineer** New York City • Remote Infrastructure \& Security Remote Full\-time About Fabric Health At Fabric Health, we are powering boundless care by solving healthcare’s biggest challenge: clinical capacity. We aren’t here to disrupt healthcare; we’re here to fix it. We unify the care journey from intake to treatment, using intelligent automation to remove administrative burdens and make care delivery 2\-10x more efficient. Our technology empowers clinicians to move faster and focus on what matters most: the patient. We are a mission\-driven team of brilliant minds trusted by leading organizations including Intermountain Health, OSF HealthCare, SSM Health, and MUSC Health. Our vision is backed by premier investors such as Thrive Capital, GV (Google Ventures), General Catalyst, and Salesforce Ventures. We move quickly for good reason, listen deeply to solve big challenges, and build products with the same care and quality we’d want for our own loved ones. **Learn more:** About Us \| News \& Press \| LinkedIn \| Careers About the Role As a Site Reliability Engineer, you will own and evolve the infrastructure powering healthcare experiences for millions of patients. This role bridges the gap between traditional infrastructure excellence and the future of AI\-driven operations. You will act as a primary architect for our AWS and Kubernetes (EKS) environment, ensuring the platform is resilient, scalable, and compliant while exploring how agentic workflows can modernize SRE practices. What You'll Do As a Site Reliability Engineer, you will be a steward of Fabric’s production integrity, leading the strategy for infrastructure automation, observability, and system resilience. Your primary responsibilities include:* **Infrastructure \& Kubernetes Orchestration** + Designing, deploying, and maintaining production Kubernetes (EKS) clusters to ensure enterprise\-grade availability for our users. + Eliminating manual configuration by building and managing a scalable infrastructure state entirely through Terraform. + Optimizing the AWS footprint—specifically EC2, RDS, and S3—to balance high performance with cost\-efficiency and reliability. * **AI\-Assisted Operations \& Automation** + Exploring and deploying agentic workflows for AI\-assisted runbooks that automate complex operational decisions and repetitive tasks. + Building and evolving deployment pipelines using GitHub Actions or Semaphore to ensure delivery is both rapid and safe. + Focusing on toil reduction by developing internal tools that replace manual operational work with intelligent, autonomous systems. * **Observability \& Incident Management** + Driving the evolution of the observability stack in Datadog by implementing the sophisticated metrics, traces, and logs needed to meet SLOs. + Leading incident response efforts and facilitating the blameless postmortems that help systematically reduce recovery time (MTTR). + Defining and monitoring the SLIs and

Apply for this position →