AI Engineer - Harness & Evals

United States·San Francisco,San Franciscofull-timemid

Machine Learning EngineerData

3 views0 saves0 applied

Apply Now

Quick Summary

Overview

About Build Build is creating the agentic AI stack for the built world. We help institutional real estate teams automate complex development and acquisitions workflows so important projects can move from concept to completion faster, with less cost, delay, and operational drag.

Technical Tools

openaipythondistributed-systems

Build is creating the agentic AI stack for the built world. We help institutional real estate teams automate complex development and acquisitions workflows so important projects can move from concept to completion faster, with less cost, delay, and operational drag.

Our customers include some of the largest built-world institutions: alternative asset investors, developers, infrastructure owners, energy companies, industrial operators, and public-sector partners. Their work shapes the physical world, but the workflows behind that work are still slow, fragmented, document-heavy, and dependent on expert coordination.

We believe the next generation of built-world software will not just organize work. It will help do the work. Agents will reason across documents, drawings, financial models, market data, approvals, constraints, and expert judgment. Human experts will stay in control, but they will operate with far more leverage.

We are backed by leading investors and operators, including executives from Blackstone and OpenAI, alongside top venture firms. We are building a generational company at the intersection of AI and the physical world.

About the Role

~1 min read

We are looking for an AI engineer, core to build the infrastructure, systems, and quality loops behind Build’s agentic platform.

This is a hands-on engineering role for someone who wants to make agents reliable, observable, scalable, and safe enough for high-stakes real-world workflows. You will work close to the agent runtime, evaluation systems, retrieval layer, tool orchestration, tracing, workflow execution, and developer platform that power Build’s product.

You should be excited by the engineering problems that appear after the demo works: how agents plan and call tools, how context is assembled, how workflows resume after failure, how quality is measured, how regressions are caught, how cost and latency are controlled, and how engineers can ship agent improvements with confidence.

This is not a research-only role. It is infrastructure work for production AI systems. Your work will define the foundation that lets Build ship faster while keeping quality, trust, and reliability high.

Responsibilities

~1 min read

→
Build the core agent platform used by product engineers to create, run, evaluate, debug, and deploy AI workflows.
→
Design infrastructure for long-running agents, tool orchestration, workflow state, retries, fallbacks, human handoff, and resumability.
→
Build context and retrieval systems that help agents use the right documents, structured data, prior decisions, project state, and tool outputs.
→
Create eval infrastructure for agent behavior, document understanding, groundedness, workflow completion, visual reasoning, latency, cost, and regressions.
→
Build observability systems for traces, prompts, model versions, tool calls, intermediate reasoning artifacts, failure modes, human overrides, and production quality metrics.
→
Improve the reliability of LLM-powered systems through deterministic checks, structured outputs, validation layers, guardrails, monitoring, and failure recovery.
→
Partner with product engineers to turn repeated workflow patterns into reusable primitives, SDKs, templates, and platform capabilities.
→
Evaluate and integrate models, agent frameworks, retrieval techniques, multimodal capabilities, and AI infrastructure tools.
→
Own performance, scalability, security, and maintainability across the AI platform.
→
Help define the engineering standards for production agent systems at Build.

Build an agent runtime that supports durable execution, resumable workflows, retries, tool permissions, human approval gates, and production traceability.
Build an eval platform where engineers can run offline tests, replay production traces, compare model and prompt changes, detect regressions, and review failure clusters.
Build a context assembly layer that combines project documents, extracted entities, workflow state, customer configuration, user intent, and tool outputs into reliable agent inputs.
Build a retrieval quality system for leases, zoning documents, drawings, investment memos, financial models, and market data, with ranking, citations, and freshness controls.
Build observability dashboards for agent task success, cost, latency, model behavior, tool failure rates, human overrides, and customer-impacting regressions.
Build reusable tool interfaces and execution policies so agents can safely query data, generate outputs, request approvals, and interact with external systems.

In your first few months, you will improve the reliability and velocity of Build’s agent platform in ways product engineers can feel. That may mean a stronger eval harness, better traces, safer tool execution, more reliable context assembly, lower latency, lower cost, or fewer production regressions.

Over the longer term, your work will make it possible for Build to ship increasingly autonomous workflows while preserving trust, auditability, security, and operational control.

You are a strong systems engineer who wants to build infrastructure for production AI agents.
You have deep experience with backend systems, distributed systems, data systems, workflow engines, observability, or developer platforms.
You are fluent in Python and comfortable designing reliable APIs, services, queues, workers, storage models, and execution systems.
You have built with LLM APIs, tool calling, structured outputs, RAG, evals, tracing, or agent frameworks.
You care about reliability, debuggability, latency, cost, safety, and maintainability.
You think in interfaces, abstractions, failure modes, and long-term platform leverage.
You can separate what should be product-specific from what should become a reusable platform primitive.
You move fast, but you care about the engineering discipline needed to make fast teams safe.

Nice to Have

~1 min read

Experience with agentic frameworks, LLMs, workflow engines, vector databases, reranking, model gateways, or AI observability tools.
Experience building eval systems, trace replay systems, regression infrastructure, prompt/model versioning, or LLM quality dashboards.
Experience with document AI, multimodal systems, structured extraction, citation systems, or knowledge graph infrastructure.
Experience designing permission systems, sandboxed tools, policy engines, secure execution layers, or audit trails for AI systems.
Experience supporting product teams through internal SDKs, frameworks, platform abstractions, or developer tooling.

Build is a high-ownership environment. We care about speed, judgment, technical depth, customer impact, and the quality of the systems we ship. Our customers operate in high-stakes environments where better software can change the pace of real-world projects, so our platform needs to be both fast-moving and trustworthy.

The people who thrive here take ownership, think clearly, act with integrity, and hold a high bar for their work. They are comfortable with ambiguity, direct feedback, ambitious goals, and close collaboration with product engineers and domain experts. They know that trust, judgment, and teamwork are what make speed sustainable.

What We Offer

~1 min read

✓Real ownership You will own foundational systems that shape how every agent at Build is built, evaluated, debugged, and deployed.

✓A deep technical frontier You will work on the hard engineering problems behind production AI: runtime design, context systems, retrieval, evals, observability, reliability, safety, and platform leverage.

✓Exceptional teammates You will work with people who think clearly, execute with intensity, and care deeply about the company we are building together.

✓Meaningful equity Ownership in the work is matched with ownership in the company.

✓A chance to shape the physical world Our software does not live only in the abstract. It helps accelerate real projects: infrastructure, energy, industrial, housing, and the systems that make the built world possible.