Machine Learning Engineer — Training Optimization
Quick Summary
About the Role We’re looking for an ML Engineer focused on training optimization to help us scale and improve large-scale model training. You’ll work at the intersection of research and production, optimizing training pipelines for speed, stability, and cost—while collaborating closely with…
Experience with large-scale distributed training (multi-node, multi-GPU) Familiarity with DeepSpeed, FSDP, Megatron, or custom training stacks Experience optimizing training on AMD or NVIDIA GPUs Contributions to open-source ML infrastructure or…
About the Role
~1 min readWe’re looking for an ML Engineer focused on training optimization to help us scale and improve large-scale model training. You’ll work at the intersection of research and production, optimizing training pipelines for speed, stability, and cost—while collaborating closely with researchers pushing model architecture and capability forward.
This is a high-impact role with real ownership: your work directly affects how fast we can iterate, how large we can scale, and how efficiently we deploy new models.
Responsibilities
~1 min read- →
Optimize large-scale model training pipelines (throughput, convergence, stability, and cost)
- →
Improve distributed training strategies (data, model, and pipeline parallelism)
- →
Tune optimizers, schedulers, batch sizing, and precision (bf16 / fp16 / fp8)
- →
Reduce training time and compute cost via profiling, bottleneck analysis, and systems-level improvements
- →
Collaborate with researchers on architecture-aware training strategies
- →
Build and maintain robust training infrastructure (checkpointing, fault tolerance, reproducibility)
- →
Evaluate and integrate new training techniques (e.g. gradient checkpointing, ZeRO, FSDP, custom kernels)
- →
Own training performance metrics and continuously push them forward
Strong experience training large neural networks (LLMs or similarly large models)
Hands-on experience with training optimization (not just model usage)
Solid understanding of:
Backpropagation, optimization algorithms, and training dynamics
Distributed systems for ML training
Experience with PyTorch (required)
Comfort working close to hardware (GPUs, memory, networking constraints)
Ability to move fluidly between research ideas and production-ready code
Nice to Have
~1 min readExperience with large-scale distributed training (multi-node, multi-GPU)
Familiarity with DeepSpeed, FSDP, Megatron, or custom training stacks
Experience optimizing training on AMD or NVIDIA GPUs
Contributions to open-source ML infrastructure or research codebases
Exposure to non-Transformer architectures (RNNs, hybrid models, etc.)
What We Offer
~1 min readLocation & Eligibility
Listing Details
- Posted
- January 22, 2026
- First seen
- May 6, 2026
- Last seen
- May 8, 2026
Posting Health
- Days active
- 0
- Repost count
- 0
- Trust Level
- 23%
- Scored at
- May 6, 2026
Signal breakdown
Please let featherlessai know you found this job on Jobera.
4 other jobs at featherlessai
View all →Explore open roles at featherlessai.
Similar Machine Learning Engineer jobs
View all →Stay ahead of the market
Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.
No spam. Unsubscribe at any time.