Staff Back End Engineer, Evals - Hazel AI

United States·San Franciscolead

EngineeringBackend Developer

1 views0 saves0 applied

Apply Now

Quick Summary

Key Responsibilities

8+ years of engineering experience, with at least 2 years focused on evaluation infrastructure, model quality, fine-tuning, or ML platform work for production systems.

Requirements Summary

Design and build Hazel's evals platform end-to-end – online scoring, offline benchmarks, regression suites, LLM-as-judge pipelines, and human-in-the-loop review workflows across every Hazel surface.

Technical Tools

EngineeringBackend Developer

About Altruist

Altruist is transforming the multi-trillion dollar wealth management industry by building an AI platform for wealth professionals. We partner with financial advisors nationwide, empowering them to grow, optimize time and resources, and deliver superior outcomes for their clients.

We're looking for exceptional talent to help us achieve our mission of making financial advice better, more affordable, and accessible to all. If you're passionate about challenging the status quo and want to do the most important work of your life, we'd love to meet you!

But first, our values

Kindness - Kindness doesn’t just equal niceness. We listen to understand. We embrace, and encourage healthy debate and diverse perspectives. We approach conflict openly, honestly, and respectfully.

Brilliance - Humility is the skill we’re most proud of and possessing a growth mindset is always top of mind. We take ownership in everything we touch; regularly using our unique superpowers to reach a common goal as a team. We succeed and fail as one.

Grit - When challenges arise, we stay laser focused on achieving our mission and finding a way forward, even when it’s hard. We are nimble and maintain a sense of urgency, swiftly adapting to change and overcoming obstacles.

About Hazel:

Hazel.ai is building the AI engine for wealth management that unlocks 10x growth, efficiency and value for financial advisors and their clients in a regulated industry. Since its launch last September, Hazel has organically and rapidly grown its user base.

Hazel is a part of Altruist’s broader mission to make financial advice better, more affordable, and accessible to all.

This role is hybrid, with four in-office days per week at our San Francisco FiDi location.

The opportunity:

Architect our evaluation platform from first principles – the observability, scoring, golden datasets, verification agents, and CI/CD integration that define standards of quality. You'll work shoulder-to-shoulder with backend engineers, product managers, and a growing bench of subject matter experts, including practicing CFPs, CPAs, and tax planners, to translate fiduciary-grade requirements into automated quality signals.

Your impact:

Design and build Hazel's evals platform end-to-end – online scoring, offline benchmarks, regression suites, LLM-as-judge pipelines, and human-in-the-loop review workflows across every Hazel surface.
Build production observability and monitoring for AI quality: hallucination rates, factual accuracy, refusal behavior, latency, cost, and domain-specific quality signals across tax planning, financial planning, investment analysis, and operational AI workflows.
Architect data curation pipelines that turn real advisor interactions into evaluation datasets – with rigorous sampling strategies, labeling protocols, dataset versioning, and the privacy and consent controls required for regulated finance.
Build and steward Hazel's golden datasets in close partnership with SMEs and a network of practicing advisors, CFPs, and tax professionals – translating their tacit expertise into precise, measurable eval criteria.
Develop LLM verification agents that catch hallucinations, computational errors, and compliance violations before they ever reach an advisor or client.
Integrate evals into our deployment pipeline so that every prompt change, model swap, harness modification, or RAG pipeline tweak runs against regression and acceptance criteria before shipping – making evals a first-class deployment gate, not a quarterly audit.
Partner with the team building Hazel's model-agnostic orchestration harness to evaluate cross-model and cross-provider performance, surface tradeoffs, and inform routing decisions across Anthropic, OpenAI, and self-hosted models.
Define quality SLOs for each Hazel surface and build alerting that catches regressions in production before our customers do – especially for high-stakes flows like tax and financial planning.
Establish Hazel's eval methodology as a defensible competitive advantage – infrastructure good enough that model upgrades from frontier labs become accelerants for us, not threats.

What you bring:

8+ years of engineering experience, with at least 2 years focused on evaluation infrastructure, model quality, fine-tuning, or ML platform work for production systems.
Deep familiarity with evaluation and scoring methodologies for modern AI systems – RAG evaluation, document processing, fine-tuned model assessment, agentic and tool-use system evaluation, LLM-as-judge frameworks, and human evaluation protocols.
Experience designing and curating golden datasets – sampling strategies, inter-rater agreement, dataset versioning, and managing the long tail of edge cases.
Comfort working across the stack – data engineering (SQL, dbt, warehouses), backend integration (APIs, async pipelines, queues), and observability tooling.
Strong communication skills. You can translate fuzzy domain requirements from advisors and SMEs into precise, measurable, automatable eval criteria – and explain quality tradeoffs clearly to engineers, product managers, and leadership.
A bias toward shipping. You believe great evals enable speed, not just safety, and you build tools that engineers actually want to use.
Bonus Points:
- Prior experience at an applied AI company building evals, model quality, or applied research infrastructure.
- Experience evaluating multi-step agentic workflows, tool-use systems, or RAG pipelines in production.
- Familiarity with frameworks like Braintrust, Langfuse or similar — including a clear point of view on when to use which.
- Background in regulated industries (financial services, healthcare, legal) where accuracy, auditability, and the cost of a wrong answer are unusually high.
- Experience building human-in-the-loop labeling workflows, annotation tooling, or red-teaming programs.
- Domain knowledge of wealth management, tax planning, or financial planning — or genuine excitement to learn it deeply alongside our SME bench.

San Francisco, CA salary range

$275,000—$325,000 USD

What we bring

Attracting and retaining top-tier talent is a priority. We are proud of the culture we’ve built and are cognizant of the ever-changing professional landscape. Our dynamic offering of perks and benefits are tailored for you to feel your best while doing your best.

Stunning, amenity-filled office spaces in Culver City, CA, San Francisco, CA, and Dallas, TX. Our offices are intentionally designed for comfort, collaboration, and productivity.
Competitive pay and equity for eligible positions.
Premium healthcare, dental, and vision insurance plans (HMO and PPO).
401k savings plan with a 4% match and immediate vesting.
16 week paid parental leave after one year of employment.
Professional growth and development opportunities including an employee mobility program and an annual L&D budget allocation for each employee.
Company perks program (includes discounts on pet insurance, fitness, cell phone plans, and travel, etc.).
Financial guidance program (includes counseling on navigating debt, tracking personal spend, saving and planning goals, home-purchasing preparedness, etc.).
One month work from anywhere policy (with the exception of a few countries).

Total compensation includes a competitive benefits package, along with equity in the form of Stock Options (ISOs) for eligible roles. For salaried positions, a salary offer will be determined by a number of factors including experience, skill level, internal pay equity, geographic location, and other relevant business considerations. We review all employee pay and compensation programs regularly to ensure fair, equitable, and competitive pay. At Altruist, we are committed to providing fair, equitable, and competitive compensation by leveraging market data to inform our pay bands. Base salaries will be reviewed at regular intervals throughout the year, typically in conjunction with performance review cycles. By evaluating compensation on a regular basis, we are able to reward high performance and ensure all employees have opportunities for growth.

Don’t meet every single requirement? Studies have shown that women and people of color are less likely to apply to jobs unless they meet every single qualification. At Altruist we are dedicated to building a diverse, inclusive, and authentic workplace, so if you’re excited about this role, but your past experience doesn’t align perfectly with every qualification in the job description, we encourage you to apply anyways. You may be just the right candidate for this or other roles.