Staff Machine Learning Infrastructure Engineer

United States·San Franciscolead

OtherMachine Learning Infrastructure Engineer

2 views0 saves0 applied

Apply Now

Quick Summary

Key Responsibilities

Design, implement, and scale repeatable machine learning infrastructure utilizing Kubernetes to support large-scale distributed GPU training of novel neural networks.

Requirements Summary

Medical, Dental, Vision, Disability, and Life Insurance Flexible Spending Account / Health Savings Account Options 401(k) Equity Sick Time, Unlimited Flexible Time Off,

Technical Tools

OtherMachine Learning Infrastructure Engineer

Atoms is building the machines that power the next era of progress.

Over the last decade, software has transformed the digital world. But the physical world, where food is made, minerals are mined, goods are moved, and industries are run, remains far less intelligent, far less efficient, and far more constrained. We’re changing that.

Atoms builds Physical AI— real-world robots for the industries that move civilization forward, starting with food, mining, and transport. Our systems are designed to understand, predict, and control the real world with precision, turning complex physical operations into something more reliable, more scalable, and more productive.

This work requires more than robotics. It requires deep integration across hardware, software, AI, operations, manufacturing, and real estate. We don’t just build machines in a lab. We deploy them into real environments, operate them, learn from them, and improve them until they work at scale.

We are roboticists, engineers, operators, and builders. We believe the next great technology companies will not only transform information, but the physical systems that shape everyday life.

If you want to work on hard problems with real-world impact, join us.

Responsibilities

~2 min read

We are seeking a foundational Machine Learning Infrastructure Engineer to design and build the large-scale ML training infrastructure that powers our next-generation autonomous transport models. In this role, you will design the high-performance training pipelines and validation environments that enable our world-class robotics and ML researchers to iterate rapidly. You will own the challenge of scaling distributed GPU workloads to support a high volume of concurrent training runs across an expanding vehicle fleet, building a platform that can flexibly run on whatever GPU capacity is available, regardless of provider or environment, directly accelerating innovation across the platform.

→Training Infrastructure: Design, implement, and scale repeatable machine learning infrastructure utilizing Kubernetes to support large-scale distributed GPU training of novel neural networks.
→Distributed Computing & Orchestration: Leverage distributed compute frameworks to efficiently manage and execute a high volume of complex ML training jobs concurrently across large GPU clusters.
→Experiment Tracking & MLOps: Integrate advanced model management and experiment tracking tools to provide researchers with deep observability into training metrics and run performance.
→Data Engineering Pipelines: Build and optimize high-throughput data ingestion pipelines to seamlessly stream petabyte-scale multi-sensor vehicle logs into training environments.
→Validation at Scale: Architect robust infrastructure for autonomous model validation and continuous integration testing, ensuring new vehicle policy releases are entirely regression-free.
→Cross-Functional Collaboration: Partner closely with core robotics engineers and machine learning researchers to eliminate workflow bottlenecks and accelerate the deploy-to-vehicle lifecycle.

8+ years of professional software engineering career experience
Strong backend systems programming skills with proficiency in Go, Python, Java or similar (with familiarity or exposure to Rust considered a plus).
Proficiency with Kubernetes for container orchestration and building cloud-agnostic environments from scratch.
Experience implementing distributed ML compute frameworks (e.g., Ray) to coordinate large pools of GPUs for heavy, multi-node workloads.
Hands-on experience building MLOps pipelines, metadata tracking architectures, and model registries using platforms like MLflow.
Prior experience managing high-throughput data pipelines using modern distributed data engines to feed data-hungry neural network architectures.

What We Offer

~1 min read

At Atoms, you’ll work on one of the defining challenges of our time - bringing automation into the physical world to drive real, lasting impact. We exist to uncover valuable unknown truths and turn them into progress, which means constantly pushing beyond what’s known and building what doesn’t yet exist. The work is ambitious and often challenging, but it’s grounded in a shared sense of purpose and a team committed to seeing it through together. Our work only matters if it serves others, and we know that meaningful progress depends on the trust of the people we serve and the strength of our team—so we invest in both, creating an environment where you can do your best work and grow.

✓Medical, Dental, Vision, Disability, and Life Insurance

✓Flexible Spending Account / Health Savings Account Options

✓401(k)

✓Equity

✓Sick Time, Unlimited Flexible Time Off, and Paid Holidays

✓Paid Parental Leave

✓Pre-Tax Commuter Benefit Plan

✓Team lunch in our SoMa office every Tuesday and Thursday

This role is based in our San Francisco office. Atoms is a company driven by invention and continuous change - we are constantly reimagining our industries, building new products, and refining how we operate. We do our best work together. That’s why all of our office-based teams work onsite, five days a week.

The base salary range for this role is $224,000 - $280,000 per year.

Actual compensation will be determined on an individual basis and may vary depending on experience, skills, and qualifications.

Base salary is just one part of your total rewards package. You may also be eligible for equity awards and an annual performance-based bonus.

#LI-Onsite