Member of Technical Staff, Evaluation Execution
Quick Summary
Our technical team members are in our office in Berkeley 3-5 days/week. Please let us know in your application if this is a constraint.
We are a nonprofit research organization that develops scientific methods to assess AI capabilities, risks, and mitigations, with a specific focus on threats related to AI R&D automation and misalignment.
METR has consistently set precedents for catastrophic AI risk evaluations, including the first independent safety evaluations (working informally with Anthropic and OpenAI in 2022), the first loss-of-control evaluations and first agentic dangerous capability evaluations, the first evaluations using finetuning (mentioned briefly here),the first independent evaluations using internal information about training, the first review partnership for company risk analysis, the first embedded redteaming, and the first evaluations of internal deployments.
We’ve been consulted and/or favorably referenced by groups on opposite ends of various spectra, including a16z, Khosla, Gary Marcus, Obama, and Dean Ball, and are known for producing one of the most positive results on AI capabilities (the time horizon trend) and the most negative (our downlift study). We’re generally referenced as the canonical third party assessor, e.g. as the obvious candidate to verify conditional pause agreements.
We believe it is robustly good for policymakers and civil society to have a clear understanding of risks from AI systems, and we are extremely excited to build a team of ambitious, excellent people to tackle one of the most important challenges of our time.
-
Running models on tasks. Often this means integrating models into our agent scaffolds, running them on our infrastructure and checking the results carefully. (METR both develops our own tasks internally and runs external evaluations.)
-
Communicating results and takeaways. This includes designing useful graphs, writing up conclusions for different audiences (system cards, risk reports, regulators, X, etc), and having great takes on what matters for risk.
-
Building software to improve our evaluations. We don't just try and run the same evaluation over and over again. We also run faster, more informative evaluations over time; this means making the right investments (with the support of our platform team).
-
Project management. Live evaluations require keeping track of a bunch of threads and staying organized. With our recent risk report process, we were running many evaluations at once.
-
Strong and professional communication. We run important and sensitive evaluations, and so the team needs to coordinate with METR leadership, lab contacts, regulators, and others.
-
As part of informing the world about risk from frontier AI systems, METR often runs and publishes evaluations of frontier models.
-
Our evaluations are a central tool the world uses to understand AI progress. Our Time Horizon methodology has been included in system cards, called an "obsession" by the NYT, has wide reach online, and is used by governments to inform national policy.
-
We’re expanding the ambition and scale of our evaluations. We have recently begun to measure model propensities and monitorability, and we are increasing the speed, reliability, and quantity of evaluations we aim to do so that we can keep the world informed.
-
Time Horizon is close to saturation, so we’re currently working on Time Horizon 2.0, which we expect to be running on models over the next 6 to 18 months.
-
We’re gearing up for our first large-scale publication on monitorability, which we believe will be similar to TH in helping folks understand trends over time.
-
We spent the past three months working on a large, industry-wide third-party risk assessment program - which includes us collecting information (and running evaluations!) for both monitorability and propensities/alignment. We expect to do much more work as part of our own risk assessment programs in the future.
In general, many ambitious impact stories for METR require us having the capacity to run many more evaluations than we have run historically. For example, while our evaluations currently inform many key decisionmakers about AI capabilities, they are not yet consistently run with the scale, reliability, and speed necessary to play concrete, codified roles in regulatory frameworks. Unlocking this capacity is part of the near-future vision for evaluation execution.
Software engineering. You're a strong engineer with solid infra fundamentals. You can dig into unfamiliar systems, debug from logs, and identify and fix performance bottlenecks.
Speed and scrappiness. You get things done quickly. You’re able to quickly identify what 80/20 looks like, and then do that.
High attention to detail. You read closely, can spot bugs in transcripts, and pay attention to the important fiddly bits.
Research understanding and taste. You understand research ideas and priorities, and have good intuitions for which plots are informative and which analyses are worth running to poke at the data.
Strong external communicator. You communicate well with external stakeholders, and we trust you to stay on the ball with communications with, e.g., lab contacts.
Project management. You can juggle many balls at once, keep stakeholders updated, and track and anticipate blockers.
Strong writing ability. You can be a solid contributor to METR’s writeups of evaluation results, see e.g. our GPT-5 report.
Location & Eligibility
Listing Details
- Posted
- April 27, 2026
- First seen
- July 3, 2026
- Last seen
- July 3, 2026
Posting Health
- Days active
- 0
- Repost count
- 0
- Trust Level
- 33%
- Scored at
- July 3, 2026
Signal breakdown
Please let Metr know you found this job on Jobera.
3 other jobs at Metr
View all →Explore open roles at Metr.
Similar Member Of Technical Staff jobs
View all →Browse Similar Jobs
Stay ahead of the market
Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.
No spam. Unsubscribe at any time.
