Senior Manager Site Reliability Engineering
Akamai Technologies
5d ago
0DevopsPolandhimalayas
Site-Reliability-EngineeringEngineering-ManagementInfrastructure-EngineeringDevOpsCloud-ComputingSenior
Job Description
Do you thrive on building reliability into AI infrastructure from the ground up?Prepare to lead SRE initiatives for AI solutions, optimizing GPU clusters and serverless inference, while ensuring global-scale performance and reliability.Join the Akamai AI Team!Akamai's Cloud Technology Group delivers AI infrastructure at global scale. Our GPU compute platform provides customers with dedicated GPU resources, from single GPUs to full clusters, for training, simulation, inference, and any workload they choose to run. Site Reliability Engineering is embedded from the start to ensure production-grade reliability and performance.Partner with the bestAs Senior Manager, you will lead the team responsible for reliability across Akamai's AI compute and platform services. You will also build the team, owning hiring strategy, candidate evaluation, and interview coordination for AI SRE roles. This is a hands-on leadership role that requires partnering with product engineering teams to embed reliability into products that are moving fast.As a Senior Manager of SRE, you will be accountable for:Fostering and growing the AI SRE team by recruiting, guiding, and supporting career development, elevating SRE expertise throughout the organization.Defining and implementing SRE practices for Akamai's AI compute and platform services, encompassing SLOs, error budgets, capacity planning, and fault management.Ensuring operational readiness for AI products by establishing quality gates, on-call rotations, runbooks, and escalation paths for AI infrastructure failure modePartnering with product engineering teams to embed reliability into the development lifecycle, influencing architecture and deployment decisionsScaling operations through software and automation, reducing toil and driving the team toward programmatic solutions over manual interventionOwning incident management integration for AI workloads, including post-incident analysis and driving systemic improvements that prevent recurrenceDo what you loveTo be successful in this role you will:Have extensive experience in SRE, infrastructure, or platform engineering, with an expertise in leading SRE teamsTrack record of building SRE teams and practices, ideally in an environment where SRE was new or being establishedDemonstrate expertise in SLOs/SLIs, observability tools, and large-scale incident management while ensuring operational efficiency.Demonstrate expertise with Kubernetes and containerization in large-scale environments.Demonstrate expertise in Python or Go automation and tooling, while possessing knowledge of Linux systems and networking fundamentals.Manage CI/CD pipelines, implement deployment safety measures, and utilize infrastructure-as-code tools like Terraform or similar alternatives.Build relationships with product engineering teams while effectively communicating SRE value in terms relevant to engineering partners.Work in a way that works for youFlexBase, Akamai's Global Flexible Working Program, is based on the principles that are helping us create the best workplace in the world. When our colleagues said that flexible working was important to them, we listened. We also know flexible working is important to many of the incredible people considering joining Akamai. FlexBase, gives 95% of employees the choice to work from their home, their office, or both (in the country advertised). This permanent workplace flexibility program is consistent and fair globally, to help us find incredible talent, virtually anywhere. We are happy to discuss working options for this role and encourage you to speak with your recruiter in more detail when you apply.
Learn what makes Akamai a great place to workConnect with us on social and see what life at Akamai is like!We power and protect life online, by solving the toughest challenges, together.At Akamai, we're curious, innovative, collaborative and tenacious. We celebrate diversity of thought and we hold an unwavering belief that we can make a meaningful difference. Our teams use their global perspectives to put customers at the forefront of everything they do, so if you are people-centric, you'll thrive here.Working for youAt Akamai, we will provide you with opportunities to grow, flourish, and achieve great things. Our benefit options are designed to meet your individual needs for today and in the future. We provide benefits surrounding all aspects of your life:Your healthYour financesYour familyYour time at workYour time pursuing other endeavorsOur benefit plan options are designed to meet your individual needs and budget, both today and in the future.About usAkamai powers and protects life online. Leading companies worldwide choose Akamai to build, deliver, and secure their digital experiences helping billions of people live, work, and play every day. With the world's most distributed compute platform from cloud to edge we make it easy for customers to develop and run applications, while we keep experiences closer to users
