Site Reliability Engineer, AI Infrastructure (Remote) - 59419267332
Quick Summary
Location: Remote (LATAM, South Africa or PH) Contract: Minimum 6-month contract with the potential for an indefinite extension based on performance. Schedule: Full-Time, Monday-Friday, PST or PH timezone. Reports to: Head of Infrastructure / SRE
I. Cluster Operations & Hardening Production SLURM Management: Operate and harden production SLURM clusters running large-scale distributed training and inference jobs.
Required Experience & Skills Deep SRE/HPC Background: 5+ years in SRE, systems engineering, or HPC operations. SLURM Expertise: Extensive production experience with SLURM at scale (accounting/slurmdbd, prolog/epilog scripts, cgroups, GRES, topology…
We operate state-of-the-art AI Factories across Europe and the US, running large-scale NVIDIA GPU clusters (H100, H200, B200, B300) on bare metal for frontier AI workloads. We design, build, and operate the full stack: datacenter power and cooling, InfiniBand fabrics, SLURM and Kubernetes orchestration, storage, and the control plane that turns raw iron into reliable compute for our customers.
We are hiring a Senior Site Reliability Engineer (SRE) to own the reliability of our GPU training and inference clusters from the US West Coast. You will serve as the on-call anchor for US hours, drive incident response on multi-thousand GPU fabrics, and push our platform toward higher availability, faster recovery, and cleaner operations. This is a hands-on role with significant production impact from week one.
Responsibilities
~1 min readProduction SLURM Management: Operate and harden production SLURM clusters running large-scale distributed training and inference jobs.
Hardware Health: Own the health of NVIDIA HGX and DGX nodes, including GPU, NVLink, NVSwitch, and BMC diagnostics.
Fabric Tuning: Debug and tune NVIDIA Quantum InfiniBand fabrics (NDR and HDR), including Subnet Manager, topology, adaptive routing, SHARP, and congestion issues.
Root Cause Analysis: Drive deep-dive RCA on GPU failures, XID errors, ECC events, thermal throttling, and link flaps.
Systems Automation: Write robust automation in Python, Go, or Bash to replace manual tasks, improve MTTR, and scale operations efficiently.
Observability Stack: Build and maintain observability for GPU fleets using Prometheus, Grafana, DCGM, node exporter, and custom exporters.
Capacity & Rollouts: Contribute to capacity planning, firmware rollout strategy, and cluster bring-up for new sites.
Workload Optimization: Partner with customer workload teams on NCCL tuning, job scheduling policy, QoS, and fairshare.
Operational Excellence: Lead post-mortems, write comprehensive runbooks, and improve change management processes across global regions.
On-Call Leadership: Participate in the on-call rotation for US hours and handle escalations from international sites when necessary.
Nice to Have
~1 min readWhat We Offer
~1 min readFrontier Infrastructure: You will touch clusters that train world-class models, working with the most advanced hardware available.
Engineering Culture: We maintain a flat structure with direct access to leadership and a culture built around technical craftsmanship and ownership.
Remote-First: Full remote flexibility with occasional travel for team summits and datacenter site visits.
Location & Eligibility
Listing Details
- First seen
- May 6, 2026
- Last seen
- May 8, 2026
Posting Health
- Days active
- 0
- Repost count
- 0
- Trust Level
- 44%
- Scored at
- May 6, 2026
Signal breakdown
Please let somewhere know you found this job on Jobera.
4 other jobs at somewhere
View all →Explore open roles at somewhere.
Similar Devops Engineer jobs
View all →Browse Similar Jobs
Stay ahead of the market
Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.
No spam. Unsubscribe at any time.