I
Ifm Us2mo ago
USD 200000–400000/yr
Senior Distributed Systems Engineer
OtherDistributed Systems EngineerSystems EngineerInfrastructure & Cloud
4 views0 saves0 applied
Quick Summary
Overview
About the Institute of Foundation Models The Institute of Foundation Models (IFM) designs and operates ultra-scale GPU supercomputing systems to train next-generation foundation models.
Technical Tools
cppgopytorchrustdistributed-systems
About the Institute of Foundation Models
The Institute of Foundation Models (IFM) designs and operates ultra-scale GPU supercomputing systems to train next-generation foundation models. We believe performance, fault tolerance, and scalability are co-designed across model architecture, communication systems, runtime, and hardware topology.
This role sits at the core of that effort — driving communication performance, distributed reliability, and cross-layer optimization for large-scale training workloads.
The Mission
We are looking for a deeply technical engineer to co-design and optimize the communication stack for large-scale distributed training, including hybrid parallelism and Mixture-of-Experts (MoE) workloads.
This is not a network operations role. This is a systems-level engineering position focused on performance engineering, distributed debugging, and communication-runtime co-design.
· Design and optimize expert-parallel and hybrid-parallel communication patterns
· Drive high-performance hierarchical collectives for MoE workloads
· Co-design runtime orchestration with communication topology awareness
· Reduce tail latency and improve determinism across thousands of GPUs
· Architect fault-tolerant distributed execution under real-world cluster failures
Core Technical Scope
· Communication-compute overlap and topology-aware collective optimization
· Deep debugging of NCCL, RDMA, and custom communication layers
· Hybrid expert parallel strategies in modern large-scale MoE systems
· Elastic and resilient distributed job orchestration concepts
· Congestion analysis and routing optimization across InfiniBand/RoCE fabrics
· Microbenchmarking and performance modeling for communication-heavy workloads
Expected Technical Depth
· Hybrid expert parallel communication for Mixture-of-Experts training
· Scaling behavior under network pressure
· Distributed orchestration for elastic, large-scale training
· Fault detection and recovery in distributed GPU workloads
· Cross-layer bottlenecks: GPU ↔ NIC ↔ PCIe ↔ NVSwitch ↔ Fabric ↔ Scheduler
Required Background
· Experience optimizing distributed training at 1,000+ GPU scale (or equivalent depth)
· Hands-on expertise with RDMA, InfiniBand, RoCE, and GPUDirect RDMA
· Deep familiarity with NCCL and/or UCX internals
· Strong systems programming ability (C/C++, Rust, or Go)
· Strong familiarity with modern model training frameworks such as PyTorch
· Ability to troubleshoot and profile training performance issues related to communication bottlenecks
· Ability to translate research ideas into production-grade optimizations
· Experience debugging distributed hangs, desynchronization, and performance regressions
What We Mean by "Hardcore"
· You can explain why an communication degrades at scale and how to fix it
· You have improved real cluster throughput via communication redesign
· You can trace a distributed hang across ranks and identify the root cause
· You are comfortable working at the boundary between hardware and runtime
Application Requirements
· Include a link to your GitHub (required)
· Provide links to relevant distributed systems, HPC, or large-scale training projects
· Include a list of publications and/or public technical reports (if applicable)
· Describe the hardest distributed debugging problem you solved
· Include measurable performance improvements you have delivered
Academic Qualifications
Master’s, or Bachelor’s + 1 year of relevant experience.
Location & Eligibility
Where is the job
Sunnyvale, United States
On-site at the office
Who can apply
US
Listed under
United States
Listing Details
- Posted
- March 3, 2026
- First seen
- March 26, 2026
- Last seen
- May 14, 2026
Posting Health
- Days active
- 48
- Repost count
- 0
- Trust Level
- 42%
- Scored at
- May 14, 2026
Signal breakdown
freshnesssource trustcontent trustemployer trust
Salary
USD 200000–400000
per year
External application · ~5 min on Ifm Us's site
Please let Ifm Us know you found this job on Jobera.
Similar Distributed Systems Engineer jobs
View all →Distributed Systems Engineer
Remote
Senior Distributed Systems Engineer
$145k–$165k/yr
Remote
Distributed Systems Engineer
Remote
Business System Analyst
Remote
Senior Associate, Sharepoint Engineer (R-19148)
Employee: Full Time
I
ImpinjexternalRemoteSenior Business Systems Engineer, Microsoft D365 F&O
$126k–$189k/yr
Remote
Browse Similar Jobs
Manager5.6kAssistant Manager5.6kTeam Member5.2kEngineer3.4kDirector2.7kAssistant2.5kConsultant2.4kAssociate2.3kData Collector2.2kFitness & Wellness2.1kTechnician2kSupervisor1.8kRestaurant General Manager1.8kCoordinator1.7kTeam Leader1.6kAnalyst1.4kCrew Member1.3kBehavioral Health1.2kSocial Worker1.1kPart Time1.1k
Newsletter
Stay ahead of the market
Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.
A
B
C
D
No spam. Unsubscribe at any time.
I
Senior Distributed Systems EngineerUSD 200000–400000