Senior Site Reliability Engineer
Granicus
14h ago
0DevopsCanada, Germany, Netherlands +2 morehimalayas
Site-Reliability-EngineeringDevOps-EngineeringCloud-EngineeringPlatform-EngineeringSenior-Site-Reliability-EngineerSenior-Site-Reliability-Engineering-ArchitectPrincipal-Site-Reliability-EngineerSenior-Reliability-EngineerSite-Reliability-Engineering-LeadSite-Reliability-Engineering-ManagerAI-OpsSenior
Job Description
The CompanyServing the People Who Serve the PeopleGranicus is driven by the excitement of building, implementing, and maintaining technology that is transforming the Govtech industry by bringing governments and its constituents together. We are on a mission to support our customers with meeting the needs of their communities and implementing our technology in ways that are equitable and inclusive. Granicus has consistently appeared on the GovTech 100 list over the past 5 years and has been recognized as the best companies to work on BuiltIn.Over the last 25 years, we have served 5,500 federal, state, and local government agencies and more than 300 million citizen subscribers power an unmatched Subscriber Network that use our digital solutions to make the world a better place. With comprehensive cloud-based solutions for communications, government website design, meeting and agenda management software, records management, and digital services, Granicus empowers stronger relationships between government and residents across the U.S., U.K., Australia, New Zealand, and Canada. By simplifying interactions with residents, while disseminating critical information, Granicus brings governments closer to the people they serve—driving meaningful change for communities around the globe.Want to know more? See more of what we do here.Job SummaryGranicus is seeking a Senior Site Reliability Engineer (SRE) with strong AIOps, automation, and AI proficiency to modernize reliability engineering through observability, intelligent incident response, and responsible AI-assisted operations. In this role, you will improve service reliability, reduce operational toil, accelerate incident response, and help build scalable, resilient platforms supporting traditional, cloud-native, and AI/ML-powered workloads. The role will also help operationalize AI-enabled SRE practices such as alert intelligence, assisted root-cause analysis, runbook automation, telemetry summarization, and governed self-healing workflows with appropriate human approval and audit controls.What Your Impact Will Look LikeProvide on-call production support using data-driven triage and AI-assisted insights to improve response speed and quality.Investigate customer and internal issues, support high-priority escalations, and drive rapid service restoration.Build and improve AI-assisted workflows for alert correlation, anomaly detection, telemetry summarization, noise reduction, incident enrichment, and controlled automated remediation.Design and maintain observability across logs, metrics, traces, and events using platforms such as ELK/OpenSearch and cloud-native monitoring tools.Lead incident troubleshooting by using telemetry, event correlation, deployment context, AI-generated summaries, and historical incident patterns to accelerate root cause identification and preventive fixes.Develop automation, runbooks, ChatOps workflows, and self-healing capabilities with human-in-the-loop approval, confidence thresholds, rollback plans, and audit trails.Drive system improvements that strengthen reliability, scalability, performance, and operational resilience.Partner with engineering, platform, and product teams to improve deployment safety, operational readiness, and service reliability.Maintain runbooks, troubleshooting guides, knowledge bases, and post-incident documentation that can be consumed by AI assistants and on-call engineers to improve readiness and knowledge sharing.Support capacity planning, performance tuning, SLO-based reliability practices, and proactive risk reduction.Apply security, privacy, access control, data protection, prompt safety, and operational guardrails across systems, automation, and AI-enabled services.You Will Love This Job If You Have5+ years in SRE, DevOps, system administration, or a similar role supporting large-scale, high-availability cloud environments.Strong expertise in Linux/Unix, networking, distributed systems, and cloud platforms such as AWS, Azure, or Google Cloud.Hands-on experience with observability platforms such as ELK, OpenSearch, Prometheus, Grafana, or similar tools.Strong understanding of AIOps concepts including anomaly detection, alert correlation, event deduplication, intelligent alerting, AI-assisted RCA, incident summarization, predictive signals, and automated or guided remediation.Strong scripting and automation skills using Python, Bash, Go, Java, or similar languages.Experience with infrastructure automation and configuration tools such as Terraform, Ansible, Chef, or Puppet.PreExperience supporting AI/ML or GenAI-enabled platforms, including model deployment, inference reliability, observability, latency, capacity, cost controls, and services such as AWS Bedrock, SageMaker, Azure AI, Google Vertex AI, or equivalent platforms.Familiarity with LLMOps or MLOps practices, including prompt/version management, evaluation, monitoring, drift detection, retrieval or knowledge-base integration, cost awareness, and model
