exa9mo ago

Software Engineer, Infrastructure

United States·San Franciscofull-timemid

Software EngineerSoftware Engineering

1 views0 saves0 applied

Apply Now

Quick Summary

Overview

Exa is building a search engine from scratch to serve every AI agent. We build massive-scale infrastructure to crawl the web, train state-of-the-art embedding models to process it, and design super high performant vector databases in rust to search over it.

Key Responsibilities

Build the Kubernetes orchestration on a $20m GPU cluster Scale our AWS batchjob system to handle map reduce jobs over 10s of thousands of machines Design GPU scheduling software so we max out our cluster utilization Build observability into our…

Requirements Summary

You have experience designing and operating large-scale infrastructure - GPU clusters or large Kubernetes clusters or cloud batchjob systems You bring an obsessive mindset — always thinking about reliability, observability, and optimization across…

Technical Tools

awskubernetesrust

Exa is an applied AI lab building a search engine unlike the world has ever seen. We build massive-scale infra to crawl the entire web, train state-of-the-art embedding models to process it, and design super high performant vector databases to retrieve over it. We now power search for Cursor, Cognition, HubSpot, and over 400,000 developers and have raised $350m from Lightspeed, Benchmark, and a16z.

Our ultimate goal is to build perfect search over all the world's information, far beyond Google. If you want to build massive-scale ML systems that will define the way the new AI world consumes information, this is the place for you.

Our Infrastructure Team builds the underlying tooling and infrastructure that powers all Exa's systems. Basically, infra engineers build the machine that builds the machine so that we can move as fast as possible as an engineering org. That could mean building GPU cluster orchestration in Kubernetes, map-reduce batchjobs on Ray, or the best observability tooling in the world.

You’re obsessed with scale and complex distributed systems
You have an aversion to ClickOps and would always build an automation
You max out OSS models’ GPU flops utilization just for the challenge
You play with Arch or NixOS as a personal driver for fun or ideology
You’re passionate and discerning about AI tooling — AI can 10x our velocity but also 10x the strain on our systems (CI, etc) and that challenge excites you

Responsibilities

~1 min read

→
Scale infrastructure to process the whole web on GPUs cost efficiently
→
Orchestrate multi-region, multi-cloud inference and training on GPUs
→
Ship a new version of our LLM gateway
→
Make the world's most advanced build cache with Nix
→
Build custom CI and code infrastructure that scales for our agent fleet
→
Automate software maintenance and improvements for the whole company

Location: This is an in-person opportunity in San Francisco.
Visas: We're happy to sponsor international candidates (e.g., STEM OPT, OPT, H1B, O1, E3). While we cannot guarantee your visa, we have historically been successful in sponsoring candidates from all over the world. If you receive an offer, our team will work hard to get you a visa.
Benefits: We offer premium healthcare benefits (medical, dental, vision), fertility benefits, 16 weeks of fully paid parental leave for all new parents, and a monthly wellness stipend to all of our employees.