Senior AI DevOps / LLMOps
TechBiz Global
2h ago
0DevopsAustralia, Canada, France +4 morehimalayas
AI-DevOpsLLMOpsMLOpsSite-Reliability-EngineeringCloud-EngineeringSenior-AI-ML-Operations-EngineerSenior-AI-LLM-EngineerSenior-AI-ML-DeveloperSenior-AI-ML-EngineerSenior
Job Description
At TechBiz Global, we are providing recruitment service to our TOP clients from our portfolio. We are currently seeking an Senior AI DevOps / LLMOpsspecialist to join one of our clients' teams. If you're looking for an exciting opportunity to grow in a innovative environment, this could be the perfect fit for you.Key ResponsibilitiesAutomation of Build-to-ProductionDesign and implement robust CI/CD pipelines tailored for AI, covering model weights,
dataset versioning, and application code.Develop specialized workflows for PromptOps, ensuring that system prompts are
version-controlled, tested for regressions, and deployed with the same rigor as traditionalcode.Automate the deployment of Agentic workflows, managing the complexities of stateful
AI interactions and multi-agent handoffs.2. AI Infrastructure as Code (IaC)Provision and manage high-performance compute environments (GPU clusters, TPU
pods) using Terraform, Pulumi, or Ansible.Define and enforce Policy-as-Code for AI endpoints to ensure compliance with security,
cost-usage limits, and data residency requirements.Maintain a consistent environment across Hybrid Infrastructure, ensuring seamless
parity between On-Premises development and Cloud production.3. Safe Experimentation & Controlled ReleasesArchitect Progressive Delivery strategies for AI, including Canary releases, Blue-Green
deployments, and Shadowing (where new models run in parallel with production tocompare outputs).Build “Evaluation-in-the-Loop” gates within the pipeline to automatically test for bias,
hallucination, and performance degradation before a release.Implement A/B testing frameworks specifically designed for LLM outputs and agentic
behavior.4. Monitoring & ObservabilityEstablish deep observability into Inference Endpoints, tracking metrics like tokens-per-
second, latency, and drift in model accuracy.Integrate feedback loops that capture production “edge cases” to feed back into the
training and fine-tuning pipelines.RequirementsMust-Have Technical Skills:Orchestration: Advanced Kubernetes (K8s) skills, specifically with KubeFlow, Ray, or
NVIDIA Triton.CI/CD & IaC: Expertise in GitHub Actions/GitLab CI, and Terraform or Pulumi.
AI Tooling: Experience with Weights & Biases, MLflow, LangSmith, or Arize
Phoenix.Hardware: Understanding of GPU virtualization, CUDA drivers, and on-premises
hardware management.Security: Familiarity with Open Policy Agent (OPA) and secret management (Vault).
Experience:10+ years in DevOps, SRE, or Cloud Engineering.
2+ years of hands-on experience in MLOps or LLMOps, specifically moving LLMs
from notebook to production.Proven experience managing Hybrid Cloud environments (e.g., AWS/Azure + Private
Data Center).Highlightsfull time and remote job
- fluent English is neededOriginally posted on Himalayas
