Site Reliability Engineer

General Dynamics Mission Systems

4h ago

0$143k - $158kDevopsUnited Stateshimalayas

Site-Reliability-EngineeringSREDevOpsSystems-EngineeringEngineering-SoftwareSenior-Site-Reliability-EngineerPrincipal-Site-Reliability-EngineerSite-Reliability-Engineering-LeadSenior

Apply Now →

Job Description

Basic Qualifications Bachelor's degree in Software Engineering, or related Science, Technology, Engineering or Mathematics field, plus a minimum of 8 years of relevant experience; or Master's degree, plus 6 years relevant experience.CLEARANCE REQUIREMENTS:: Department of Defense Secret security clearance is required at time of hire. Applicants selected will be subject to a U.S. Government security investigation and must meet eligibility requirements for access to classified information. Due to the nature of work performed within our facilities, U.S. citizenship is required.Responsibilities for this PositionWhat You'll OwnSLOs and reliability metrics. Define service level objectives for every AI service that goes to production. Establish error budgets and use them to drive engineering decisions — not just measure uptime.Monitoring and observability. Build and maintain monitoring, logging, and alerting infrastructure for AI services. You will know when something is degrading before users do.Incident response. Establish incident management procedures, lead post-incident reviews, and drive corrective actions. When something breaks, you coordinate the response and ensure it doesn't break the same way again.Operational readiness reviews. Before any AI service goes live, you validate that it meets reliability, security, and operational standards. You are the gate between "it works in dev" and "it's ready for production."Capacity planning and cost monitoring. Track resource consumption, forecast capacity needs, and monitor costs — tokens, compute, storage. You ensure the platform scales without surprises.Toil elimination. Identify and automate repetitive operational tasks. If a human is doing something a script could do, you fix that.What You Won't OwnApplication development or AI model building — you ensure what they build is operable, you don't build itInfrastructure provisioning — IT provides the infrastructure; you define what's needed and validate it worksBusiness process decisions or backlog prioritizationWhat Makes This Role DifferentAI services have failure modes that traditional applications don't — model drift, token budget exhaustion, prompt injection, upstream data quality degradation. You will build monitoring for problems that most SRE teams have never encountered.You are applying SRE principles from scratch. There is no existing SRE practice to inherit — you will define it for the platform.Your operational readiness reviews directly determine whether AI services go live. You have real authority to say "not ready."Required QualificationsBachelor’s degree in Computer Science, Software Engineering, or a related field, plus 5 years of experience; or Master’s degree plus 3 years of experienceProduction SRE or DevOps experience — you have owned the reliability of systems that real users depended on, not just built CI/CD pipelinesHands-on experience with monitoring and observability tools — Prometheus, Grafana, Datadog, ELK, CloudWatch, or similar. You have built dashboards and alerts that caught real problems.Strong scripting and automation skills — Python, Bash, infrastructure-as-code (Terraform, CloudFormation, or similar)Experience with containerized environments — Docker, Kubernetes, container orchestration at scaleExperience defining and managing SLOs, error budgets, and incident response procedures in productionU.S. citizenship required. Department of Defense Secret security clearance is required at time of hire.Preferred QualificationsExperience with AI/ML production systems — model serving, inference monitoring, token cost tracking, or similarMulti-cloud experience (AWS, Azure, GCP) including cloud-native monitoring and logging servicesExperience building operational readiness review processes or production launch checklistsFamiliarity with Google SRE principles — you have read the book and applied the concepts, not just referenced them in interviewsExperience in environments where reliability has compliance or safety implications — defense, healthcare, finance, or critical infrastructureWhat Sets You ApartYou think about failure before you think about features. Your first question about any new system is "how does this break?"You automate yourself out of toil. If you're doing the same thing twice, you write a script.You have said "not ready" to a team that wanted to ship, and you were right.You build monitoring that tells you what's wrong, not just that something is wrong.You write post-incident reviews that actually change how systems are built, not just how incidents are documented.DetailsRemote — 100% telework9/80 scheduleDefense industry experience is not requiredSalary NoteThis estimate represents the typical salary range for this position based on experience and other factors (geographic location, etc.). Actual pay may vary. This job posting will remain open until the position is filled.Combined Salary RangeUSD $142,696.00 - USD $158,303.00 /Yr.Company OverviewGeneral Dynamics Mission System

Apply for this position →