braintrust
braintrust1mo ago
New

Eval Engineer

San Francisco, New York Cityfull-timemid
OtherEngineer
0 views0 saves0 applied

Quick Summary

Overview

About the company Braintrust is the AI observability platform. By connecting evals and observability in one workflow,

Technical Tools
notionpythonvercelab-testing

Braintrust is the AI observability platform. By connecting evals and observability in one workflow, Braintrust gives builders the visibility to understand how AI behaves in production and the tools to improve it.

Teams at Notion, Stripe, Zapier, Vercel, and Ramp use Braintrust to compare models, test prompts, and catch regressions — turning production data into better AI with every release.

About the Role

~1 min read

We’re hiring an Eval Engineer to design and run creative evaluations of new AI capabilities. Your job is to turn emerging AI ideas into measurable experiments and publish the results for the developer ecosystem.

When new models, agents, or frameworks appear, everyone has opinions about what works but few people actually test them. This role exists to change that.

You’ll design experiments that compare models, prompts, and agent architectures against real tasks. You’ll build the datasets, scoring logic, and evaluation harnesses. Then you’ll publish the results so builders understand what actually works.

This role sits at the intersection of engineering, experimentation, and technical storytelling.

  • Design and run evaluations of new AI capabilities

  • Compare frontier models, agent systems, and tool workflows

  • Turn emerging ideas into measurable benchmarks

  • Define datasets, tasks, and scoring logic for experiments

  • Design realistic workloads that reflect production environments

  • Create tests that expose failure modes and edge cases

  • Build evaluation harnesses using Braintrust

  • Run comparisons across models, prompts, and agent approaches

  • Analyze traces, outputs, and failure patterns

  • Invent novel ways to stress test AI systems

  • Design scenarios that break agents, prompts, and model reasoning

  • Build adversarial or complex datasets that reveal weaknesses

  • Write technical posts explaining evaluation methodology and results

  • Share datasets and scoring logic so experiments are reproducible

  • Help establish better evaluation patterns for the industry via courses

  • Develop reusable eval patterns for agents, RAG systems, and LLM apps

  • Create open source reference implementations developers can adopt

  • Contribute examples and guides that help teams build better evals

  • You’re an engineer who likes testing systems more than building features

  • You enjoy breaking things and understanding why they fail

  • You can design experiments that isolate meaningful differences between approaches

  • You understand how LLMs, agents, and RAG systems actually work

  • You write clearly for technical audiences

  • You ship experiments quickly and iterate often

  • You care about methodology and reproducibility

  • You’re curious, creative, and opinionated about how AI should be evaluated

Responsibilities

~1 min read
  • Built or contributed to evaluation systems for LLM or agent applications

  • Designed experiments comparing models, prompts, or AI architectures

  • Written Python code to run tests across models or APIs

  • Built datasets or scoring logic for AI quality measurement

  • Investigated model failures or unexpected behaviors

  • Published technical blog posts, research notes, or engineering write-ups

  • Built prototypes quickly to test ideas

If you want to help the industry understand how to measure AI systems and design the evaluations everyone else learns from, this is the role.

What We Offer

~1 min read
Medical, dental, and vision insurance
Daily lunch, snacks, and beverages
Flexible time off
Competitive salary and equity
AI Stipend

Braintrust is an equal opportunity employer. All applicants will be considered for employment without attention to race, color, religion, sex, sexual orientation, gender identity, national origin, veteran or disability status.

Location & Eligibility

Where is the job
Location terms not specified
Who can apply
Same as job location

Listing Details

Posted
March 13, 2026
First seen
May 6, 2026
Last seen
May 8, 2026

Posting Health

Days active
0
Repost count
0
Trust Level
13%
Scored at
May 6, 2026

Signal breakdown

freshnesssource trustcontent trustemployer trust
Newsletter

Stay ahead of the market

Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.

A
B
C
D
Join 12,000+ marketers

No spam. Unsubscribe at any time.

braintrustEval Engineer