F
Machine Learning Engineer — Inference Optimization
Featherless AI
3h ago
0DataAustralia, Canada, Germany +3 morehimalayas
AI-Inference-EngineerAI-Optimization-EngineerMid-Level-AI-Inference-EngineerMachine-Learning-EngineerML-Inference-EngineeringMid-level
Job Description
About the RoleWe’re looking for a Machine Learning Engineer to own and push the limits of model inference performance at scale. You’ll work at the intersection of research and production—turning cutting-edge models into fast, reliable, and cost-efficient systems that serve real users.This role is ideal for someone who enjoys deep technical work, profiling systems down to the kernel/GPU level, and translating research ideas into production-grade performance gains.What You’ll DoOptimize inference latency, throughput, and cost for large-scale ML models in productionProfile and bottleneck GPU/CPU inference pipelines (memory, kernels, batching, IO)Implement and tune techniques such as:Quantization (fp16, bf16, int8, fp8)KV-cache optimization & reuseSpeculative decoding, batching, and streamingModel pruning or architectural simplifications for inferenceCollaborate with research engineers to productionize new model architecturesBuild and maintain inference-serving systems (e.g. Triton, custom runtimes, or bespoke stacks)Benchmark performance across hardware (NVIDIA / AMD GPUs, CPUs) and cloud setupsImprove system reliability, observability, and cost efficiency under real workloadsWhat We’re Looking ForStrong experience in ML inference optimization or high-performance ML systemsSolid understanding of deep learning internals (attention, memory layout, compute graphs)Hands-on experience with PyTorch (or similar) and model deploymentFamiliarity with GPU performance tuning (CUDA, ROCm, Triton, or kernel-level optimizations)Experience scaling inference for real users (not just research benchmarks)Comfortable working in fast-moving startup environments with ownership and ambiguityNice to HaveExperience with LLM or long-context model inferenceKnowledge of inference frameworks (TensorRT, ONNX Runtime, vLLM, Triton)Experience optimizing across different hardware vendorsOpen-source contributions in ML systems or inference toolingBackground in distributed systems or low-latency servicesWhy Join UsReal ownership over performance-critical systemsDirect impact on product reliability and unit economicsClose collaboration with research, infra, and productCompetitive compensation + meaningful equity at Series AA team that cares about engineering quality, not hypeOriginally posted on Himalayas
