Staff ML Platform Engineer – Large Scale Training (LLMOps/MLOps)
Quick Summary
About TrueFoundryEvery production AI system, whether it's powering customer support, writing code, analyzing financial data, or diagnosing medical conditions, needs the same foundational infrastructure.A way to route between models. A way to manage tools and integrate them securely.
We're TrueFoundry, and we're building it. We're looking for a Staff ML Platform Engineer – Large Scale Training (LLMOps/MLOps) to join the team.
Companies are moving beyond simple chatbots to production agentic systems. These systems route between OpenAI, Anthropic, Google, and self-hosted models. They integrate dozens of tools via protocols like MCP. They orchestrate multi-agent workflows where agents coordinate with other agents.
The infrastructure to support this doesn't exist yet. You can't just duct-tape together a few API calls and call it production-ready.
You need a control plane that handles:
- Intelligent routing with observability, cost policies, and fallback logic
- Centralized tool and MCP server management with security and lifecycle controls
- Agent orchestration with governance and guardrails
- A unified compute layer to run self-hosted models, custom tools, and agents
We've built two products to solve this:
AI Gateway is the control plane, five composable components (Prompts, LLM Gateway, MCP Gateway, Guardrails, Agent Gateway) that handle routing, orchestration, and governance.
AI Deploy is the compute layer, a Kubernetes-based platform that abstracts ML workloads as standard software primitives, so everything runs on unified infrastructure.
We're Series A, backed by Intel Capital and Sequoia. Companies like CVS, Mastercard, Siemens, Paytm, Synopsys, and Zscaler run production AI workloads on our platform.
We're looking for ML Systems Engineers who are passionate about scaling deep learning workloads, optimizing multi-GPU training, and shipping production-grade solutions. If you live and breathe PyTorch, multi-node training, and love solving gnarly infra challenges—this is your place.
- Write clean, modular, and scalable Python code, with a strong emphasis on reliability and performance.
- Build platform for training and finetuning large-scale ML models across multi-GPU, multi-node clusters with PyTorch, Kubeflow, and other orchestration tools.
- Own the infrastructure and code that enables high-throughput, low-latency inference pipelines for state-of-the-art models.
- Build platform for developing, deploying and evaluating agentic applications for our end customers.
- Help shape internal standards and best practices across the engineering team for high-scale ML workloads.
- 5+ years of hands-on experience building and deploying ML systems at scale.
- 5+ years of writing production quality high performance code.
- Deep experience with multi-GPU/multi-node training, ideally with PyTorch as your primary framework.
- Experience working with torch, high-level ML frameworks, and inference engines (vLLM or TensorRT).
- Experience with Kubernetes is highly preferred; exposure to Kubernetes-native tools is a huge plus.
- A pragmatic mindset—you know when to optimize and when to ship.
- Bonus: Familiarity with open-source LLM training/fine-tuning.
What We Offer
~1 min readLocation & Eligibility
Listing Details
- Posted
- May 2, 2025
- First seen
- May 6, 2026
- Last seen
- May 8, 2026
Posting Health
- Days active
- 0
- Repost count
- 0
- Trust Level
- 15%
- Scored at
- May 6, 2026
Signal breakdown
Please let truefoundry know you found this job on Jobera.
4 other jobs at truefoundry
View all →Explore open roles at truefoundry.
Similar Ml Platform Engineer jobs
View all →Browse Similar Jobs
Stay ahead of the market
Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.
No spam. Unsubscribe at any time.