somewhere
somewhere~2d ago
New

Site Reliability Engineer, AI Infrastructure (Remote) - 59419267332

Remotemid
EngineeringDevops Engineer
0 views0 saves0 applied

Quick Summary

Overview

Location: Remote (LATAM, South Africa or PH) Contract: Minimum 6-month contract with the potential for an indefinite extension based on performance. Schedule: Full-Time, Monday-Friday, PST or PH timezone. Reports to: Head of Infrastructure / SRE

Key Responsibilities

I. Cluster Operations & Hardening Production SLURM Management: Operate and harden production SLURM clusters running large-scale distributed training and inference jobs.

Requirements Summary

Required Experience & Skills Deep SRE/HPC Background: 5+ years in SRE, systems engineering, or HPC operations. SLURM Expertise: Extensive production experience with SLURM at scale (accounting/slurmdbd, prolog/epilog scripts, cgroups, GRES, topology…

Technical Tools
ansiblegrafanakubernetesprometheuspythonpytorchterraformlinuxnetworking

We operate state-of-the-art AI Factories across Europe and the US, running large-scale NVIDIA GPU clusters (H100, H200, B200, B300) on bare metal for frontier AI workloads. We design, build, and operate the full stack: datacenter power and cooling, InfiniBand fabrics, SLURM and Kubernetes orchestration, storage, and the control plane that turns raw iron into reliable compute for our customers.

We are hiring a Senior Site Reliability Engineer (SRE) to own the reliability of our GPU training and inference clusters from the US West Coast. You will serve as the on-call anchor for US hours, drive incident response on multi-thousand GPU fabrics, and push our platform toward higher availability, faster recovery, and cleaner operations. This is a hands-on role with significant production impact from week one.

Responsibilities

~1 min read

Production SLURM Management: Operate and harden production SLURM clusters running large-scale distributed training and inference jobs.

Hardware Health: Own the health of NVIDIA HGX and DGX nodes, including GPU, NVLink, NVSwitch, and BMC diagnostics.

Fabric Tuning: Debug and tune NVIDIA Quantum InfiniBand fabrics (NDR and HDR), including Subnet Manager, topology, adaptive routing, SHARP, and congestion issues.

Root Cause Analysis: Drive deep-dive RCA on GPU failures, XID errors, ECC events, thermal throttling, and link flaps.

Systems Automation: Write robust automation in Python, Go, or Bash to replace manual tasks, improve MTTR, and scale operations efficiently.

Observability Stack: Build and maintain observability for GPU fleets using Prometheus, Grafana, DCGM, node exporter, and custom exporters.

Capacity & Rollouts: Contribute to capacity planning, firmware rollout strategy, and cluster bring-up for new sites.

Workload Optimization: Partner with customer workload teams on NCCL tuning, job scheduling policy, QoS, and fairshare.

Operational Excellence: Lead post-mortems, write comprehensive runbooks, and improve change management processes across global regions.

On-Call Leadership: Participate in the on-call rotation for US hours and handle escalations from international sites when necessary.

Nice to Have

~1 min read

What We Offer

~1 min read

Frontier Infrastructure: You will touch clusters that train world-class models, working with the most advanced hardware available.

Engineering Culture: We maintain a flat structure with direct access to leadership and a culture built around technical craftsmanship and ownership.

Remote-First: Full remote flexibility with occasional travel for team summits and datacenter site visits.

Location & Eligibility

Where is the job
Worldwide
Fully remote, anywhere in the world
Who can apply
Same as job location

Listing Details

First seen
May 6, 2026
Last seen
May 8, 2026

Posting Health

Days active
0
Repost count
0
Trust Level
44%
Scored at
May 6, 2026

Signal breakdown

freshnesssource trustcontent trustemployer trust
Newsletter

Stay ahead of the market

Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.

A
B
C
D
Join 12,000+ marketers

No spam. Unsubscribe at any time.

somewhereSite Reliability Engineer, AI Infrastructure (Remote) - 59419267332