← Back to all jobs
Tyk

Site Reliability Engineer

Tyk

4h ago

0DevopsCanadahimalayas
Site-Reliability-EngineeringDevOpsCloud-EngineeringInfrastructure-EngineeringKubernetes-AdministrationMid-level

Job Description

Who are Tyk, and what do we do? The Tyk API Management platform is helping to drive the connected world and power new products and services. We’re changing the way that organisations connect any number of their systems and services.Whether internal, external, public or highly encrypted systems, Tyk helps businesses drive value across the retail, finance, telecoms, healthcare, or media industries (to name just a few!)If you’ve banked online, used an app to check the news, or perhaps even driven a connected car, API’s, and by extension, Tyk, make that possible. Founded in 2015 with offices in London – UK, London – Ontario, Atlanta and Singapore, we have many thousands of users of our B2B platform across the globe. Brands using Tyk range from Lotte, Bell, T Mobile, to RBS, Capital One and Vinci. We have a varied user base hailing from every continent – even Antarctica.Our MissionTyk is on a mission to connect every system in the world. We’ve started by building an API Management platform.Total flexibility, default remote, radical responsibilityWe offer unlimited paid holidays and remote working from anywhere in the world, for everyone, Why? Tyk was founded on the principle of offering flexibility and autonomy to our employees, we believe this allows our employees to achieve their best results. It also means we can build the best possible team, location and working hours are no barrier. If this sounds like an environment that you believe could work for you then read on to find out more.The role:We’re looking for a Site Reliability Engineer to manage, maintain, improve and provide support on our platform. You will be curious by nature, always looking for ways to improve, as we will look to you for new ideas, solutions and metrics on how we can improve the platform. You will also be our first line of incident management to our clients and will help define our response going forward. This is a great opportunity to become an integral part of Tyk as we continue on our journey.As a remote first company, you will have the opportunity to work with an industry leading distributed team. Having access to expertise from across the globe will give you both the support and opportunity to help shape not only Tyk’s Cloud platform but also the Tyk as a whole as we continue to grow.RequirementsHere’s what you’ll be responsible for:Maintaining global Tyk Cloud within SL(A/I/O)s you will help to defineIdentifying reliability issues and working together with your squad to solve themIdentifying and introducing new metrics and building relevant dashboardsParticipating in the on-call rotationWorking with your squad to expand multi-region and multi-cloud reach of the platformDocumenting operational knowledgeConducting post-incident analysisAutomating common tasksBe a key shaper and contributor to our continuous improvement agenda – be it the clarity of our user stories, how we estimate, communicate with other teams or customers – we expect this role to be advocate of continuous improvementReliability of our new global Tyk Cloud platformAutomation of operations and supportWriting and maintaining documentation on SRE processes and policiesRecommending and implementing ways of driving operational efficiency and driving down our cost to run, without impacting serviceAssisting in penetration testing for Cloud through liaising with our provider, providing technical details, and environment setupIncident managementHere’s what we’re looking for:ExperienceStrong collaboration skillsLaunching and operating production scale kubernetes clustersDesigning and operating infrastructure on AWS and other providersOperating MongoDB (or other document database) clustersOperating Redis (or other key-value storage) clustersAdministering Linux serversMaintaining distributed softwareOperating Prometheus and GrafanaOperating logging collection and analysis systemsParticipating in the on-call rotation(16:00pm – 4:00am UTC)Skills:Kubernetes & containers (advanced)AWS / EKS (advanced)Linux (advanced)Terraform and IaC in general (proficient)Helm (proficient)Go and/or Python (familiar)MongoDB (or similar)Redis (or similar)Monitoring – prometheus, grafana, thanos (familiar)Grasp of networking concepts (subnets, routing, peering, load balancing, NAT, etc.)Common networking protocols (DNS, TCP/IP, HTTP, TLS, UDP)Proactive, energetic, innovative and change orientedNice to have:GCP or AzureBare metal infrastructure engineeringAPI management experienceLarge scale distributed storage managementFamiliarity with RancherCKA/CKAD/CKSCreating and delivering production software in Go languageBenefitsHere’s why you should join us:Everyone has unlimited paid holiday. We have total flexibility in hours, as we believe creativity flows better when our people are given freedom to decide when they are most productive. Everyone is unique after all.Employee share schemeGenerous maternity and paternity leaveCompany retreatsWe all share the same vision – we value authenticity, respect, re