LLM Pre-training & Distributed Engineer (AI Infrastructure)
Quick Summary
Orchestrate distributed training runs across 1,000+ GPUs using PyTorch, DeepSpeed, or Megatron-LM. Optimize networking (InfiniBand/RDMA) and memory management to prevent out-of-memory errors.
We are seeking a highly skilled LLM Pre-training & Distributed Systems Engineer. This role is essential for orchestrating large-scale machine learning training runs and optimizing distributed infrastructure. The ideal candidate will have a deep understanding of GPU clusters and extensive experience in system engineering to ensure efficient and reliable training processes.
Responsibilities
~1 min read- →Orchestrate distributed training runs across 1,000+ GPUs using PyTorch, DeepSpeed, or Megatron-LM.
- →Optimize networking (InfiniBand/RDMA) and memory management to prevent out-of-memory errors.
- →Automate checkpointing and failure recovery during month-long training runs.
- Deep expertise in 3D parallelism (Data, Tensor, Pipeline).
- Experience managing SLURM or Kubernetes-based GPU clusters.
- Strong systems engineering background (C++, CUDA, Python).
Location & Eligibility
Listing Details
- Posted
- April 24, 2026
- First seen
- April 24, 2026
- Last seen
- May 4, 2026
Posting Health
- Days active
- 10
- Repost count
- 0
- Trust Level
- 35%
- Scored at
- May 4, 2026
Signal breakdown

Web3 and AI talent recruitment agency based in Hong Kong with 700+ placements globally
Please let Hyphenconnect know you found this job on Jobera.
4 other jobs at Hyphenconnect
View all →Explore open roles at Hyphenconnect.
Browse Similar Jobs
Stay ahead of the market
Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.
No spam. Unsubscribe at any time.