fathom.video
New

AI Engineer - Model Performance

Sf HybridHybridfull-timemid
Machine Learning EngineerData
0 views0 saves0 applied

Quick Summary

Overview

ABOUT FATHOM We created Fathom to eliminate the needless overhead of meetings. Our AI assistant captures, summarizes, and organizes the key moments of your calls, so you and your team can stay fully present without sacrificing context or clarity.

Requirements Summary

Hard Skills: Deep experience with LLM serving frameworks (vLLM, SGLang, TensorRT-LLM, or similar) — not just deploying them, but tuning them: attention backends, scheduling strategies, CUDA graph warmup, prefix caching Hands-on quantization…

Technical Tools
hubspotnotionpythonpytorchslackzoomconcurrency

We think you’ll be pretty excited about Fathom too if you give it a try. Sign up today (it’s free)!

ROLE OVERVIEW

We're hiring a Model Performance Engineer to own the speed, cost, and reliability of our model inference stack, and to build the fine-tuning infrastructure that makes the rest of the AI team faster.

This is not a research role. You'll be optimizing real systems serving millions of meetings — choosing between quantization trade-offs, debugging speculative decoding, or figuring out why one GPU family's tail latency explodes at high concurrency while another stays stable.

Requirements

~1 min read
  • Deep experience with LLM serving frameworks (vLLM, SGLang, TensorRT-LLM, or similar) — not just deploying them, but tuning them: attention backends, scheduling strategies, CUDA graph warmup, prefix caching

  • Hands-on quantization experience — you've gone beyond "apply FP8 and hope." You understand weight vs activation quantization, per-channel vs per-tensor scaling, and when dynamic quantization introduces more overhead than it saves

  • Production fine-tuning experience — LoRA/QLoRA SFT, familiarity with training frameworks (ms-swift, Axolotl, torchtune, or similar), understanding of data formatting, learning rate schedules, and how to diagnose training failures

  • Strong Python. You'll write serving infrastructure, benchmarking harnesses, and training pipelines — not notebooks

  • Comfort with GPU profiling and performance analysis. You should be able to look at a benchmark result and know whether the bottleneck is compute, memory bandwidth, or scheduling overhead

  • Cost modeling for GPU infrastructure — you've had to choose between GPU types and justify the tradeoff

  • Experience with multimodal models (audio/vision encoders + LLM decoders)

  • Experience with Modal, Ray Serve, or similar serverless GPU platforms

  • Understanding of audio processing (codecs, chunking, sample rates)

  • Experience building internal tooling that other engineers use — this role succeeds when the rest of the team ships faster

  • ML research background or publications

  • Prompt engineering expertise (we have a team for that)

  • Frontend or full-stack experience

  • Masters/PhD (though it's fine if you have one)

 
  • The opportunity to shape the foundational software services of a growing company

  • A role that balances innovation and incremental improvement

  • A dynamic and collaborative engineering team

  • Competitive compensation and benefits

  • A supportive environment that encourages innovation and personal growth

 
  • Opportunity for impact. We’re established enough to ship instead of fighting fires and early enough that your work will have a real impact.

  • Startup experience. You’ll work closely with our CEO, a 2X Founder/CEO with a background in computer science and product design.

  • We embrace being fully remote. We schedule meetings sparingly and instead heavily use async comms (Slack, Notion, Loom)

  • You’ll meet the entire team. We think it’s important that you get to meet everyone you’ll be working with.

  • No bullshit. Ask us anything you like. We’ve never understood why companies pretend they’re something that they’re not in the hiring process - you’re going to find out eventually so we’d rather you know who we are up front so we can both make sure this is a good fit for all involved.

  • Quick turnaround time. We know you have lots of options so we move fast usually in less than a week from start to finish.

Include a brief write-up or demo of inference optimization or model serving work you've done. We care about the reasoning behind your decisions — why you chose a specific quantization strategy, how you diagnosed a performance regression, what tradeoffs you navigated. A GitHub repo, blog post, or even a few paragraphs in your cover letter works.

Location & Eligibility

Where is the job
Sf Hybrid
Hybrid — some on-site time required
Who can apply
Same as job location

Listing Details

Posted
May 5, 2026
First seen
May 6, 2026
Last seen
May 8, 2026

Posting Health

Days active
0
Repost count
0
Trust Level
54%
Scored at
May 6, 2026

Signal breakdown

freshnesssource trustcontent trustemployer trust
Newsletter

Stay ahead of the market

Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.

A
B
C
D
Join 12,000+ marketers

No spam. Unsubscribe at any time.

fathom.videoAI Engineer - Model Performance