Software Engineer, Infrastructure
Quick Summary
hardening baselines, SELinux/AppArmor policies, SSH key management, vulnerability scanning,
boot process, kernel tuning, networking, storage, systemd, cgroups, namespaces, performance profiling Strong experience with configuration management and infrastructure-as-code: Ansible, Terraform,
You are a hands-on engineer who builds the software and processes that keep a large fleet of GPU servers healthy and productive. You write systems and tooling for managing 1000s of servers including provisioning, health monitoring, error detection, and recovery — and when something breaks that automation can’t fix, you drive resolution with partners.
Responsibilities
~1 min read- →Build and maintain Python fleet tracking system that manages the full lifecycle of servers including contracting and procurement, target use, pricing, availability, health, RMAs, etc
- →Build server management tooling that automates provisioning, health checks, GPU diagnostics, recovery and alerting
- →Create and maintain metrics, dashboards, and alerting for hardware health across the fleet (GPU errors, disk failures, network issues, thermals)
- →Leverage AI to an extreme level to build tools and automate alerting and recovery
- →Implement and enforce OS-level security: hardening baselines, SELinux/AppArmor policies, SSH key management, vulnerability scanning, and compliance automation
- →Manage and optimize distributed and local storage systems supporting model weights, checkpoints, and ephemeral scratch: NVMe arrays, NFS, parallel file systems, and object storage
- →Tune Linux systems for AI workloads: kernel parameters, NUMA topology, CPU pinning, hugepages, I/O schedulers, and GPU driver stack optimization (NVIDIA drivers, CUDA, container runtimes)
- →Develop a suite of automated error detection and recovery processes
- →Work with partners to solve technical issues
Requirements
~1 min read- 3+ years experience managing bare-metal and cloud based server fleets at scale (100+ nodes)
- Strong software engineering skills in Python; you write production tooling, not scripts
- Deep Linux systems knowledge: boot process, kernel tuning, networking, storage, systemd, cgroups, namespaces, performance profiling
- Strong experience with configuration management and infrastructure-as-code: Ansible, Terraform, cloud-init
- Solid understanding of storage technologies: LVM, RAID, NVMe, NFS, Lustre or GPFS, and Linux I/O stack tuning
- Familiarity with hardware diagnostics and failure modes (GPUs, NVMe, NICs, memory)
- Experience building internal tools or dashboards for infrastructure visibility
- Excellent communication and ability to drive technical decisions across teams
- Self-starter who executes quickly, takes ownership, and constantly seeks improvement
Nice to Have
~1 min read- Familiarity with network configuration and diagnostics (VLAN, VXLAN, ECMP, BGP, tcpdump)
- Experience with NVIDIA GPU infrastructure: driver management, health monitoring, DCGM, NVLink/NVSwitch diagnostics, RDMA, InfiniBand/RoCEv2
- Experience with AMD GPUs
- Experience with bare metal and VM provisioning (PXE/iPXE, Kickstart, libvirt, Qemu/KVM)
- Experience with compliance frameworks relevant to cloud providers (SOC 2, ISO 27001)
What We Offer
~1 min read-
San Francisco, CA (we are open to remote in the US for Senior and Staff levels)
What We Offer
~1 min readLocation & Eligibility
Listing Details
- Posted
- February 23, 2026
- First seen
- March 26, 2026
- Last seen
- May 5, 2026
Posting Health
- Days active
- 40
- Repost count
- 0
- Trust Level
- 31%
- Scored at
- May 5, 2026
Signal breakdown
Please let Fal know you found this job on Jobera.
4 other jobs at Fal
View all →Explore open roles at Fal.
Similar Software Engineer jobs
View all →Browse Similar Jobs
Stay ahead of the market
Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.
No spam. Unsubscribe at any time.
