Head of AI Evaluation & Reliability Engineering

India·Puneexecutive

OtherHead

1 views0 saves0 applied

Apply Now

Quick Summary

Overview

Head of AI Evaluation & Reliability Engineering Location: Flexible / HybridReports To: Head of Engineering Role MissionBuild and scale Codvo’s AI Evaluation & Reliability Engineering capability as a core engineering function supporting the design, validation, and continuous improvement of…

Technical Tools

ci-cd

Head of AI Evaluation & Reliability Engineering

Location: Flexible / Hybrid
Reports To: Head of Engineering

Role Mission
Build and scale Codvo’s AI Evaluation & Reliability Engineering capability as a core engineering function supporting the design, validation, and continuous improvement of enterprise AI systems in production.

You will architect the frameworks, tooling, benchmark assets, and operational processes required to ensure AI systems deployed by Codvo and its customers meet enterprise standards for reliability, safety, performance, and governance.

This role is deeply embedded within engineering and serves as the quality and reliability backbone for Codvo’s AI platform and delivery organization.

Why This Role Matters
As AI systems move from pilots to business-critical workflows, reliability and evaluation become core engineering disciplines—not optional afterthoughts.

Codvo is building the infrastructure and operational rigor required to ensure every AI deployment is measurable, governed, and production-ready.

Core Responsibilities

Engineering Ownership
- Build Codvo’s AI Evaluation & Reliability Engineering function as a core platform/engineering capability.
- Define engineering standards for AI evaluation, testing, release gating, and runtime monitoring.
- Integrate evaluation/reliability frameworks into Codvo’s engineering and delivery lifecycle.

Evaluation Architecture
- Design reusable evaluation frameworks for:
- LLM / multimodal quality
- RAG grounding / evidence fidelity
- Agent reasoning / decision quality
- Tool / workflow execution success
- Safety / policy / compliance adherence
- Cost / latency / production economics

Benchmark Infrastructure
- Build benchmark packs, golden datasets, and regression suites for priority enterprise workflows.
- Define benchmark coverage and versioning standards.
- Establish processes for edge-case capture and benchmark expansion.

Runtime Reliability Systems
- Design systems/processes for:
- Runtime drift / degradation monitoring
- Failure mode analysis / incident diagnostics
- Human review / escalation pathways
- Continuous evaluation and improvement loops

Technical Leadership
- Partner closely with platform, product, and solution engineering teams.
- Serve as internal SME on AI reliability, benchmark design, and evaluation methodology.
- Help shape architecture standards for AI-native product and workflow delivery.

Team Leadership
- Build and lead a team of:
- Evaluation Engineers
- Benchmark / QA Engineers
- Reliability / Observability Engineers
- Domain Review / Feedback Ops Specialists

Required Qualifications
- 10+ years in engineering / AI / ML leadership roles.
- 5+ years building or operating production AI / ML systems.
- Proven experience designing or operating:
- AI/LLM evaluation frameworks
- Benchmark / regression systems
- AI QA / testing / validation infrastructure
- Production ML / observability / monitoring systems
- Reliability engineering / quality engineering organizations

Technical Expertise
- LLM / multimodal evaluation methodologies
- Benchmark / golden dataset design
- Agent / tool-use / workflow evaluation
- RAG evaluation / grounding analysis
- AI observability / telemetry / tracing
- Human-in-the-loop feedback systems
- AI safety / governance / policy testing
- Release gating / CI/CD / engineering quality systems

Preferred Backgrounds
- AI Infrastructure / Evaluation Platforms
- AI Observability / MLOps Companies
- Enterprise AI Platform Teams
- Applied AI Product / Platform Organizations
- Reliability / QA Engineering Leadership in Complex Systems

Success Metrics
- Establish Codvo-wide AI evaluation/reliability standards
- Integrate evaluation frameworks into engineering lifecycle
- Launch reusable benchmark packs for target workflows
- Reduce AI production failure / exception rates across deployments
- Improve release confidence and deployment velocity for AI systems
- Increase benchmark/evaluation asset reuse across customers

Ideal Candidate Profile
- Systems/reliability engineer mindset with strong AI depth
- Product-minded builder who can create reusable engineering frameworks
- Obsessed with operational excellence and measurable quality
- Comfortable driving standards across engineering organizations

Note- Please apply via our official careers portal only, as applications sent directly to executives may not be considered.