truefoundry14mo ago

Staff ML Platform Engineer – Large Scale Training (LLMOps/MLOps)

United States·San FranciscoHybridlead

OtherMl Platform Engineer

3 views0 saves0 applied

Apply Now

Quick Summary

Overview

About TrueFoundryEvery production AI system, whether it's powering customer support, writing code, analyzing financial data, or diagnosing medical conditions, needs the same foundational infrastructure.A way to route between models. A way to manage tools and integrate them securely.

Technical Tools

anthropickubernetesopenaipythonpytorchcustomer-successcustomer-supportdeep-learning

We're TrueFoundry, and we're building it. We're looking for a Staff ML Platform Engineer – Large Scale Training (LLMOps/MLOps) to join the team.

Companies are moving beyond simple chatbots to production agentic systems. These systems route between OpenAI, Anthropic, Google, and self-hosted models. They integrate dozens of tools via protocols like MCP. They orchestrate multi-agent workflows where agents coordinate with other agents.

The infrastructure to support this doesn't exist yet. You can't just duct-tape together a few API calls and call it production-ready.

You need a control plane that handles:

Intelligent routing with observability, cost policies, and fallback logic
Centralized tool and MCP server management with security and lifecycle controls
Agent orchestration with governance and guardrails
A unified compute layer to run self-hosted models, custom tools, and agents

We've built two products to solve this:

AI Gateway is the control plane, five composable components (Prompts, LLM Gateway, MCP Gateway, Guardrails, Agent Gateway) that handle routing, orchestration, and governance.

AI Deploy is the compute layer, a Kubernetes-based platform that abstracts ML workloads as standard software primitives, so everything runs on unified infrastructure.

We're Series A, backed by Intel Capital and Sequoia. Companies like CVS, Mastercard, Siemens, Paytm, Synopsys, and Zscaler run production AI workloads on our platform.

We're looking for ML Systems Engineers who are passionate about scaling deep learning workloads, optimizing multi-GPU training, and shipping production-grade solutions. If you live and breathe PyTorch, multi-node training, and love solving gnarly infra challenges—this is your place.

Write clean, modular, and scalable Python code, with a strong emphasis on reliability and performance.
Build platform for training and finetuning large-scale ML models across multi-GPU, multi-node clusters with PyTorch, Kubeflow, and other orchestration tools.
Own the infrastructure and code that enables high-throughput, low-latency inference pipelines for state-of-the-art models.
Build platform for developing, deploying and evaluating agentic applications for our end customers.
Help shape internal standards and best practices across the engineering team for high-scale ML workloads.

5+ years of hands-on experience building and deploying ML systems at scale.
5+ years of writing production quality high performance code.
Deep experience with multi-GPU/multi-node training, ideally with PyTorch as your primary framework.
Experience working with torch, high-level ML frameworks, and inference engines (vLLM or TensorRT).
Experience with Kubernetes is highly preferred; exposure to Kubernetes-native tools is a huge plus.
A pragmatic mindset—you know when to optimize and when to ship.
Bonus: Familiarity with open-source LLM training/fine-tuning.