Human Evaluation - Program Manager

Netflix

51d ago

0$230k - $340kManagementUnited Stateshimalayas

Program-ManagementData-OperationsAI-OperationsHuman-EvaluationData-LabelingSenior

Apply Now →

Job Description

At Netflix, our mission is to entertain the world. Together, we are writing the next episode - pushing the boundaries of storytelling, global fandom and making the unimaginable a reality. We are a dream team obsessed with the uncomfortable excitement of discovering what happens when you merge creativity, intuition and cutting-edge technology. Come be a part of what’s next.About the RoleNetflix is building toward more intelligent and responsive systems—and thoughtful, high-quality evaluation is essential to making sure we’re moving in the right direction. Join a team who are creating the frameworks, tools, and workflows that ensure human judgment is applied with consistency, clarity, and care—whether we’re evaluating helpfulness, tone, safety, relevance, or creative quality.You’ll not only shape how human and AI-driven evaluations are designed—but also own the day-to-day execution of these efforts. From scoping and planning, to rater onboarding and calibration, you’ll be accountable for driving delivery from start to finish. Just as critically, you’ll act as a thought partner and influencer—bringing stakeholders along as you introduce new ways of working, build alignment across teams, and establish a shared language around quality. Your work will help ensure that AI features at Netflix are not only high-performing, but also aligned with our values, our users, and the creative integrity that defines our brand.You’ll work in a small team to ensure that evaluation designs are not only rigorous and aligned—but also effectively resourced, scoped, and executed at scale.The OpportunityThis is a rare opportunity to get in on the ground floor of a function that will shape how we measure and guide the performance of AI systems at Netflix. In this role, you’ll partner across research, product, UX, and engineering to develop frameworks, rubrics, and workflows that enable rigorous, scalable human evaluation. But beyond shaping the “what” and “how,” you’ll also lead the “when” and “done.” You’ll be responsible for keeping evaluation projects on track—ensuring consistent execution, timely delivery, and high rater alignment. If you're excited to bring structure to ambiguity and influence how Netflix develops responsible AI—while being accountable for tangible delivery—this is your chance to create meaningful impact from day one.The ideal candidate :The ideal candidate brings a rare combination of structure and flexibility. You know how to create evaluation frameworks that are rigorous and scalable—and you’re also a driver who gets them out the door. You’re skilled at translating vision into workflows, defining milestones, and delivering consistent results in a dynamic environment. You can steer teams across functions, keep timelines on track, and ensure rater quality without micromanaging. You thrive in spaces where there’s no roadmap, and you take pride in making things real, not just possible.Responsibilities:Lead end-to-end execution of human evaluation and data operations initiatives—from intake and scoping to deliveryDevelop and operationalize frameworks for evaluating GenAI and ML outputsCollaborate across research, product, UX, and engineering to embed evaluation into model development cyclesBuild and maintain project timelines, proactively manage blockers, and ensure timely executionDevelop clear, scalable guidelines and scoring rubrics to ensure consistent rater judgmentOversee rater onboarding, calibration, and QA workflowsDefine and monitor success metrics such as speed to IRR, throughput, and task effectivenessPilot and refine evaluation tasks to improve clarity, inter-rater reliability, and feedback qualityBuild foundational documentation and drive adoption of best practices across teamsTrack evaluation health and proactively communicate progress to stakeholders clearly and proactivelyAnticipate and proactively resolve bottlenecks and blockersAct as the connective tissue across multiple partners to ensure alignment and effective execution of evaluations at scaleQualifications4+ years of experience working in human evaluations, data collection, labeling, or annotation operations in GenAI/ML environmentsTrack record of implementing process improvements or quality control systems for data collection needsPrior experience managing human annotation vendors, raters, or data labeling teamsStrong understanding of evaluation design, including guidelines, rubrics, and scoring protocolsProven ability in end-to-end management of complex, cross-functional programs, demonstrating strong Program Management skills and clear accountability for successful delivery.Experience with human labeling platformsExcellent written and verbal communication skillsAbility to synthesize feedback into clear recommendations and process improvementsFamiliarity with responsible AI principles and how to embed them into evaluation designStrong organizational skills and executional focus; ability to track details while seeing the bigger pictureGener

Apply for this position →