← Back to all jobs
Deimos

Senior Site Reliability Engineer

Deimos

21h ago

0DevopsKenyahimalayas
Site-Reliability-EngineeringDevOps-EngineeringCloud-EngineeringPlatform-EngineeringProfessional-ServicesSenior

Chill analysis

0/10

How async / no-phone this role is, scored from the listing text:

  • on-call (-2)

Job Description

Deimos is a Cloud-native Developer and Security Operations technology services company. We help companies of all sizes adopt the Cloud for improved service delivery to their clients. We’re a fully remote African-based team of engineers who are passionate about implementing engineering best practices. We leverage the latest technologies while building globally competitive solutions for our clients. With Deimos being one of the two moons of Mars, we refer to ourselves as “Martians” who are on a mission to Mars, together. Our teams value the ability to learn and adapt to technology changes while appreciating solid foundational design and the craft of software engineering. As such our engineers enjoy working with various clients who have different problems to solve. If this sounds like you then you would be an ideal fit for our environment. However, you must be based in one of the countries we currently hire in which are as follows: Kenya, Ghana, Nigeria, South Africa, and Senegal.Role OverviewWe are looking for an experienced Senior Site Reliability Engineer to join our Professional Services team and deliver Software and DevSecOps projects. You will report to a Site Reliability Engineering Manager.SRE / DevOps is one of our core competencies. You will be part of a highly-skilled team that continuously innovates and delivers high value solutions to clients across various industries on all public clouds (AWS, Azure, GCP, etc). Technologies we work with daily include Kuberenetes, Helm, Terraform, GitOps, just to name a few.What you will be doingEnablement & RelOps CultureImplement the Observability Ladder: Guide teams from basic monitoring to high-signal metric tracking. Work with product teams to define SLAs, SLIs, and SLOs, and build dashboards that track specific error budgets.Empower Product Teams: Build frameworks and deployment tooling (e.g., CI/CD, internal tooling integrations) that allow teams to make data-driven decisions on deployment safety and automate rollbacks when error budgets are depleted.Champion Reliability: Drive a blameless post-mortem culture focused on actionable takeaways, system improvements, and measurable metrics (MTBF, MTTR).Frameworks & AutomationStandardised Alerting & On-Call: Continuously improve company-wide alerting and on-call frameworks to reduce alert fatigue, ensuring alerts are highly actionable and symptom-based.Disaster Recovery: Drive evolution of DR strategies from manual processes into fully automated runbooks-as-code, allowing teams to prove and improve service recoverability through autonomous, evidence-based testing.Eliminate Toil: Develop systems, automations, and tooling for pre- and post-deployment verification, ensuring our hands-off reliability vision becomes a production reality, via Python (or similar).Reliability-as-Code: Lead the drive to manage our entire reliability suite through IaC. Use Terraform to architect, deploy, and configure our observability stack including ELK, Grafana, Loki, Prometheus, and Tracing.What you must haveBachelor's degree in Computer Science, Information Technology, or a related field.5+ years of experience in Software Engineering, SRE, DevOps, or Platform Engineering, with demonstrable ownership of reliability standards at a team or company level.Strong coding fluency: Proficiency in Python (or similar) with the ability to read, understand, reason about, and write production-grade automation code.Cloud & IaC: Hands-on experience with AWS, and a solid understanding of Infrastructure as Code (Terraform or CloudFormation).Deep Observability Knowledge: Demonstrable experience with monitoring tools (DataDog, Prometheus, ELK stack). Strong understanding of SRE concepts including Golden Signals, high-cardinality data handling, and error budget mathematics.Systems Thinking: Strong grasp of designing for scale and resilience, including graceful failure, circuit breaking, connection pooling, and multi-AZ deployments.Proven ability to define and drive reliability standards across multiple teams and drive a blameless post-mortem culture.Qualities & BehavioursExceptional interpersonal and communication skillsA zest for automation.Comfortable working as a remote team member.Ability to keep up to date with DevOps/SRE best practices, trends and innovation.Passionate about mentoring and growing technical skills within the team.Expected Output for the roleAutomate Azure infrastructure provisioning and configuration using PowerShell, YAML and Bicep. Monitor and troubleshoot issues in the Azure environment, including network, storage, and compute resources.Deploy and manage Azure Databricks infrastructure for data processing and analytics. Attend to support tickets, which may arise due to product components not functioning as expected. Develop and maintain technical support documentation of the product. Promote innovations to support business requirements through activities that test, pilot and implement innovative concepts. Responsible for suppor