Ifm Us12mo ago

USD 150000–450000/yr

Machine Learning Infrastructure Engineer

United States·SunnyvaleFull-timemid

Data ScienceOtherDevOps & InfrastructureMachine LearningMachine Learning Infrastructure EngineerData & AI

7 views0 saves0 applied

Apply Now

Quick Summary

Overview

About the Institute of Foundation Models We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy.

Key Responsibilities

We're looking for a distributed ML infrastructure engineer to help extend and scale our training systems. You’ll work side-by-side with world-class researchers and engineers to: • Extend distributed training frameworks (e.g., DeepSpeed, FSDP,…

Requirements Summary

• 5+ years of experience in ML systems, infra, or distributed training • Experience modifying distributed ML frameworks (e.g., DeepSpeed, FSDP, FairScale, Horovod) • Strong software engineering fundamentals (Python, systems design, testing) • Proven…

Technical Tools

kubernetespythonpytorchdeep-learningmachine-learningperformance-optimization

About the Institute of Foundation Models

We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy.

As part of our team, you’ll have the opportunity to work on the core of cutting-edge foundation model training, alongside world-class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem-solving skills will be instrumental in establishing MBZUAI as a global hub for high-performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers.

The Role

We're looking for a distributed ML infrastructure engineer to help extend and scale our training systems. You’ll work side-by-side with world-class researchers and engineers to:

• Extend distributed training frameworks (e.g., DeepSpeed, FSDP, FairScale, Horovod)

• Implement distributed optimizers from mathematical specs

• Build robust config + launch systems across multi-node, multi-GPU clusters

• Own experiment tracking, metrics logging, and job monitoring for external visibility

• Improve training system reliability, maintainability, and performance

• While much of the work will support large-scale pre-training, pre-training experience is not required. Strong infrastructure and systems experience is what we value most.

Key Responsibilities

• Distributed Framework Ownership – Extend or modify training frameworks (e.g., DeepSpeed, FSDP) to support new use cases and architectures.

• Optimizer Implementation – Translate mathematical optimizer specs into distributed implementations.

• Launch Config & Debugging – Create and debug multi-node launch scripts with flexible batch sizes, parallelism strategies, and hardware targets.

• Metrics & Monitoring – Build systems for experiment tracking, job monitoring, and logging usable by collaborators and researchers.

• Infra Engineering – Write production-quality code and tests for ML infra in PyTorch or JAX; ensure reliability and maintainability at scale.

Qualifications

Must-Haves:

• 5+ years of experience in ML systems, infra, or distributed training

• Experience modifying distributed ML frameworks (e.g., DeepSpeed, FSDP, FairScale, Horovod)

• Strong software engineering fundamentals (Python, systems design, testing)

• Proven multi-node experience (e.g., Slurm, Kubernetes, Ray) and debugging skills (e.g., NCCL/GLOO)

• Ability to implement algorithms across GPUs/nodes based on mathematical specs

• Experience working on an ML platform/ infrastructure, and/or distributed inference optimization team

• Experience with large-scale machine learning workloads (strong ML fundamentals)

Nice-to-Haves:

• Exposure to mixed-precision training (e.g., bf16, fp8) with accuracy validation

• Familiarity with performance profiling, kernel fusion, or memory optimization

• Open-source contributions or published research (MLSys, ICML, NeurIPS)

• CUDA or Triton kernel experience

• Experience with large-scale pre-training

• Experience building custom training pipelines at scale and modifying them for custom needs

• Deep familiarity with training infrastructure and performance tuning

Location & Eligibility

Where is the job

Sunnyvale, United States

On-site at the office

Who can apply

Listed under

United States

Listing Details

Posted: July 18, 2025
First seen: March 26, 2026
Last seen: July 17, 2026

Posting Health

Days active: 113
Repost count: 0
Trust Level: 42%
Scored at: July 17, 2026

Signal breakdown

freshnesssource trustcontent trustemployer trust

Apply for this position

Ifm Us

lever

Jobs

View 43 jobs

View company profile

Salary

USD 150000–450000

per year

Apply now

External application · ~5 min on Ifm Us's site

Please let Ifm Us know you found this job on Jobera.

4 other jobs at Ifm Us

View all →

Explore open roles at Ifm Us.

Inference Optimization Intern – Performance Modeling

Intern | Fall

Eval360 - Error Analysis Engineer

USD 150000–450000

Full-time

AI Research Internship - WM

USD 100000–140000

Intern | Summer

Research Scientist, Agentic Data & Benchmarking

USD 150000–450000

Full-time

Similar Machine Learning Infrastructure Engineer jobs

View all →

Coupang

Senior Staff Machine Learning Infrastructure Engineer – Search & Discovery

Plus 2

Senior Machine Learning Infrastructure Engineer

$160k–$200k/yr

Full-time

AirbnbRemote

Senior Staff Machine Learning Engineer, Growth Platform Engineering

USD 244000-305000

Remote

Waabi

Senior / Staff Machine Learning Infrastructure Engineer

Full-time

Newsbreak

Staff Machine Learning Engineer, Recommendation & AI Platform

USD 230000-300000

Botauto

Software Engineer, Machine Learning Infrastructure

Newsletter

Stay ahead of the market

Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.

Join 12,000+ marketers

No spam. Unsubscribe at any time.

Machine Learning Infrastructure EngineerUSD 150000–450000

Apply Now

Machine Learning Infrastructure Engineer

Quick Summary

Location & Eligibility

Listing Details

Posting Health

4 other jobs at Ifm Us

Similar Machine Learning Infrastructure Engineer jobs

Browse Similar Jobs

Stay ahead of the market