Principal Applied ML Researcher (Agentic Systems & Applied AI Platform)
Red Cell Partners
4d ago
0$230k - $300kOtherUnited Stateshimalayas
Machine-Learning-ResearchAI-ResearchApplied-AILLM-EngineeringMLOpsSenior
Job Description
About UsRed Cell Partners is an incubation firm building and investing in rapidly scalable technology-led companies that are bringing revolutionary advancements to market in three distinct practice areas: healthcare, cyber, and national security. United by a shared sense of duty and deep belief in the power of innovation, Red Cell is developing powerful tools and solutions to address our Nation’s most pressing problems.About TraseCo-founded in 2023 by Joe Laws and Grant Verstandig, Trase Systems is AI, Uncomplicated. Trase empowers enterprise leaders to harness the full potential of AI without the associated complexity and risks. We are an end-to-end solution for deploying, managing, and optimizing AI in the enterprise. Our platform specializes in bridging the “last mile” of AI adoption, unlocking AI's full potential while driving efficiency and significant cost savings.Trase is at the forefront of AI Agent innovation, topping the Hugging Face GAIA Leaderboard for Generalized AI Assistants, ahead of industry giants such as Google, Meta, Microsoft, and OpenAI. We are leveraging our cutting-edge technologies to develop mission-critical agentic applications in complex industries such as Healthcare, Oil & Gas, and National Security.About the RoleAs a Principal Applied ML Researcher, you will define and drive the ML and LLM strategy for Trase OS, the agentic execution platform powering deployments in regulated environments.You are responsible for how models behave inside real production systems - including agent workflows, tool use, and long-lived execution -not just offline model performance.This is a hands-on technical leadership role operating at the intersection of research, systems, and product. You will drive technical breakthroughs in agentic infrastructure and applied AI systems, own the end-to-end research-to-production lifecycle, and set the standard for how ML systems are designed, evaluated, and deployed across Trase.Why This Role ExistsTrase OS coordinates long-lived agents, multi-step workflows, tool-augmented LLMs, and execution in regulated environments.As the system scales, the core challenge shifts from model capability to system correctness and reliability, where models may succeed offline but fail in real workflows, agent behavior can become unpredictable or unsafe, evaluation can drift from real outcomes, and ML decisions can introduce system-level instability.This role defines how ML systems are integrated into execution systems, evaluated end-to-end, and operated reliably in production.ResponsibilitiesTechnical Leadership & InnovationDrive technical breakthroughs in agentic systems, applied ML infrastructure, and LLM-based applications.Define and evolve the ML/LLM strategy and technology roadmap in alignment with product development.Act as a principal technical authority, making high-impact architectural and modeling decisions across teams.Research → Prototype → ProductionDevelop prototypes for key technologies to validate new approaches and de-risk system design.Own the full lifecycle from research and experimentation through production deployment, monitoring, and iteration.Translate advances in ML into scalable, production-grade systems with measurable impact.Agentic Systems & Applied MLDesign how LLMs operate within agent workflows, tool use, and multi-step reasoning and long-lived execution.Implement and refine prompting strategies, multi-agent orchestration, memory management, and human-in-the-loop controls for safety and reliability.Establish patterns for planning, decision-making, and tool orchestration within complex systems.Evaluation, Quality & ReliabilityOwn end-to-end quality evaluation of ML-powered systems, including defining metrics, benchmarks, and testing frameworks.Establish evaluation systems that connect model performance to task success and system-level outcomes.Ensure systems behave predictably, safely, and reliably in production through monitoring, regression testing, and robust failure handling.ML Systems & Platform IntegrationContribute to the design of ML systems supporting the full lifecycle, including training, fine-tuning, evaluation, deployment, and monitoring.Drive architecture decisions across model serving, routing, orchestration, and latency and cost optimization.Work across infrastructure layers, including cloud and containerized systems, to ensure scalable and efficient deployment.Enterprise & Regulated DeploymentsBuild and deploy enterprise-grade AI systems used by global customers in production environments.Design systems that operate reliably in regulated and constrained settings, including on-premise, air-gapped, and secure cloud environments.Ensure systems are auditable, explainable, and compliant with regulatory and organizational requirements.Communication & InfluenceWrite technical reports and design documents summarizing R&D progress, system behavior, and key decisions.Communicate complex ML concepts and tradeoffs clearly to both technical and n
