High Performance Computing Software Engineer - Supercomputing
Quick Summary
About the Institute of Foundation Models We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy.
IFM is building the foundational compute infrastructure that will power tomorrow’s breakthroughs in AI and computational science.
Proven experience developing and optimizing software for large-scale ML workloads (1000+ GPUs preferred). Deep understanding of Linux kernel internals and accelerator (GPU) kernel development.
Responsibilities
~1 min read- →Design and implement high-performance, distributed software solutions for large-scale AI/ML training.
- →Optimize low-level system components including Linux kernel, GPU/accelerator kernels, and interconnects.
- →Develop and tune communication libraries such as NCCL, MPI, UCX, RCCL, and RDMA-based systems.
- →Partner with ML researchers and engineers to support frameworks like PyTorch, MegatronLM, and DeepSpeed in large-scale production environments.
- →Contribute to our scheduling, orchestration, and job management systems, including Slurm and Kubernetes.
- →Debug and resolve complex issues across the stack—from kernel to container to model.
- →Work closely with hardware vendors, upstream open-source communities, and internal teams to drive performance and reliability improvements.
Requirements
~1 min read- Proven experience developing and optimizing software for large-scale ML workloads (1000+ GPUs preferred).
- Deep understanding of Linux kernel internals and accelerator (GPU) kernel development.
- Proficiency with distributed communication libraries (e.g., NCCL, RCCL, MPI, UCX, SHARP, Libfabric).
- Experience with ML frameworks like PyTorch, TensorFlow, JAX, or MegatronLM.
- Strong knowledge of HPC job scheduling and orchestration tools (e.g., Slurm, Kubernetes, Pyxis).
- Excellent debugging and systems performance tuning skills.
- A collaborative mindset with a focus on shared success and technical excellence.
Location & Eligibility
Listing Details
- Posted
- April 3, 2026
- First seen
- April 4, 2026
- Last seen
- June 1, 2026
Posting Health
- Days active
- 60
- Repost count
- 0
- Trust Level
- 42%
- Scored at
- June 3, 2026
Signal breakdown
Please let Ifm Us know you found this job on Jobera.
Similar Software Engineer jobs
View all →Browse Similar Jobs
Stay ahead of the market
Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.
No spam. Unsubscribe at any time.