Staff Test Engineer - AI

India·HyderabadRemoteFull-Timelead

EngineeringQa Engineer

0 views0 saves0 applied

Apply Now

Quick Summary

Key Responsibilities

Own the AI Quality Strategy. Define and lead the end-to-end testing strategy for Outreach’s GenAI platform, including agentic workflows, LLM tool calls, LangGraph orchestration,

Requirements Summary

7–12 years of experience in software development and/or test automation, with demonstrated experience leading quality efforts on complex, distributed systems. B.S.

Technical Tools

EngineeringQa Engineer

About Outreach

Outreach, founded in 2014, is the only complete agentic AI platform for revenue teams. Outreach infuses agentic AI, conversation intelligence, and assistive AI to power hundreds of use cases across revenue motions. From new logo prospecting to expansions, deal acceleration, driving retention, and forecasting, Outreach AI automates workflows and frees sellers to focus on more strategic conversations and actions. Revenue leaders benefit from connected account visibility, performance insights, and higher forecasting accuracy across every GTM team. World leading enterprise organizations use Outreach to power their revenue teams, including Databricks, SAP, Siemens, and Verizon to name a few.

Summary:

At Outreach, we build the technology that powers the world’s leading sales execution platform — and over the last year, we have been moving fast to bring AI to the center of how revenue teams work. We have shipped agents that research accounts, personalize outreach, run meetings, and drive revenue workflows end-to-end. With Ask Outreach, we have built a fully agentic, conversational platform on LangGraph that lets users interact with their Outreach data and workflows in entirely new ways.

As we scale the depth and breadth of our AI platform, quality is not an afterthought — it is foundational. We are looking for a Staff AI Test Engineer who is first and foremost an exceptional quality engineer, and who brings a genuine curiosity and working understanding of how AI and LLM-based systems behave, fail, and improve. If you are passionate about building rigorous test strategies for complex, probabilistic systems at scale, we want to talk to you.

Description:

We are seeking a Staff-level engineer to own quality for our GenAI platform and agent ecosystem. This is a high-impact, strategic role where you will define and lead testing practices across a rapidly evolving agentic platform — including the agents themselves, the tools they call, the LangGraph orchestration layer, and the underlying ML pipelines and data flows.

This role requires someone who understands the unique challenges of testing AI systems: outputs are not always deterministic, correctness is often contextual, and traditional pass/fail assertions are insufficient on their own. You will design and implement evaluation frameworks that combine deterministic validation with LLM-based grading, establish quality standards for agent behavior, and partner closely with Data Science, Engineering, and Product teams to make quality a shared discipline.

You will be a senior voice in how we build, ship, and continuously improve AI products at Outreach.

Responsibilities:

Own the AI Quality Strategy. Define and lead the end-to-end testing strategy for Outreach’s GenAI platform, including agentic workflows, LLM tool calls, LangGraph orchestration, and supporting ML pipelines.

Build Evaluation Frameworks. Design and implement evaluation systems that handle both deterministic and non-deterministic outputs — combining rule-based assertions, golden dataset testing, and LLM-as-Judge approaches to grade agent responses at scale.

Test Agents End-to-End. Own testing across Outreach’s suite of AI agents — Revenue Agent, Research Agent, Meeting Agent, Personalisation Agent, and Ask Outreach — covering functional correctness, tool selection accuracy, context handling, and response quality.

Partner with DS and Engineering. Work closely with Data Science, MLOps, and platform engineers to ensure testability is designed in from the start — not bolted on after.

Drive CI/CD for AI. Integrate evaluation pipelines into CI/CD workflows so that regressions in agent behavior are caught before they reach production.

Define Quality Metrics. Establish and track metrics that matter for AI systems: answer quality scores, tool invocation accuracy, hallucination rates, latency, and regression trends over model and prompt changes.

Champion Best Practices. Define standards for AI testing across the org — including prompt regression testing, retrieval quality evaluation, and agent behavior contracts.

Mentor and Influence. Raise the quality bar across engineering teams by mentoring engineers, reviewing designs for testability, and advocating for quality-driven development practices.

Stay Current. Actively track developments in AI evaluation tooling, LLM benchmarking, and testing research — and bring relevant advances into our practice.

Minimum Qualifications:

7–12 years of experience in software development and/or test automation, with demonstrated experience leading quality efforts on complex, distributed systems.

B.S. in Computer Science or a related technical field.

Strong programming skills in Python, with experience writing reusable, maintainable test frameworks.

Proven experience testing large-scale backend or platform systems, including microservices and API layers.

Deep understanding of test design principles, CI/CD integration, and scalable test automation.

Experience with test frameworks such as PyTest or equivalent.

Solid understanding of evaluation methodologies for non-deterministic systems — including statistical assertions, behavioral testing, and regression baselines.

Hands-on experience with Databricks for building and validating ML pipelines and data workflows.

Experience with MLflow for experiment tracking, model versioning, and pipeline observability.

Strong communication and collaboration skills across engineering, data science, and product functions.

Preferred Qualifications:

Experience testing GenAI products, LLM-based systems, or agentic AI platforms.

Experience with prompt engineering and prompt tuning — understanding how prompt changes affect model behavior and building regression suites to catch prompt-driven regressions.

Hands-on experience with LLM-as-Judge evaluation patterns — using LLMs to grade LLM outputs at scale.

Familiarity with LangGraph, LangChain, or similar agent orchestration frameworks.

Experience with ML pipelines, ML flow tooling (e.g., MLflow, Kubeflow, Metaflow), or model evaluation workflows.

Understanding of RAG (Retrieval-Augmented Generation) architectures and how to evaluate retrieval quality.

Experience with cloud platforms (AWS, GCP, or Azure) and containerized environments (Docker, Kubernetes).

Domain knowledge in sales, sales engagement, or CRM platforms (e.g., Salesforce, HubSpot, or similar) — understanding the workflows, terminology, and data that sales teams operate with.

Prior experience contributing to AI quality strategies in a product or research environment.

We’re an equal opportunity employer. All applicants will be considered for employment without attention to race, color, religion, sex, sexual orientation, gender identity, national origin, veteran or disability status.

Our success is reliant on building teams that include people from different backgrounds and experiences who can elevate assumptions and ideas with fresh perspectives. We're dedicated to hiring the whole human, not just a resume. To that end, we look for a diverse pool of applicants-including those from historically marginalized groups. We would like to invite you to apply even if you don't think you meet all of the requirements listed below. We don't want a few lines in a job description to get between us and the opportunity to meet you.