crusoe
crusoe1d ago
New

Staff Network Engineer, Operations

United StatesUnited States·San Franciscofull-timelead
Network EngineerInfrastructure & Cloud
0 views0 saves0 applied

Quick Summary

Key Responsibilities

Production Reliability: Help own uptime across Crusoe's global edge, backbone, data center, and GPU cluster network, directly supporting AI workloads at scale.

Technical Tools
Network EngineerInfrastructure & Cloud

Crusoe is on a mission to accelerate the abundance of energy and intelligence. As the only vertically integrated AI infrastructure company built from the ground up, we own and operate each layer of the stack — from electrons to tokens — to power the world's most ambitious AI workloads. When you join Crusoe, you join a team that is building the future, faster.

We're in the midst of the greatest industrial revolution of our time. The demand for AI compute is boundless, and power is a bottleneck. We're solving that — with an energy-first approach that makes AI infrastructure better for the world and faster for the people innovating with AI.

We're looking for problem-solving, opportunity-finding teammates with a sense of urgency, who believe in the scale of our ambition and thrive on a path not fully paved — people who want to grow their careers alongside a team of experts across energy, manufacturing, data center construction, and cloud services.

If you want to do the most meaningful work of your career, help our customers and partners advance their AI strategies, and be part of a high-performing team that believes in each other, come build with us at Crusoe.

About the Role

~1 min read

Crusoe Cloud is seeking a Staff Network Operations Engineer to help own production reliability across our global network infrastructure, including edge, backbone, data center fabric, and GPU cluster interconnects. This is a hands-on production ownership role focused on incident response, root cause analysis, and operational excellence initiatives that keep our hyperscale AI infrastructure running at scale. Your work will directly affect the availability of AI workloads running across thousands of GPUs worldwide.

The ideal candidate is a seasoned network engineer with deep operational experience in large-scale environments who thrives in high-pressure situations and takes pride in keeping systems healthy. You'll contribute to defining SLIs and SLOs, improving observability tooling, building automation to reduce toil, and mentoring peers — all while serving as a key escalation point during high-severity network events.

  • Production Reliability: Help own uptime across Crusoe's global edge, backbone, data center, and GPU cluster network, directly supporting AI workloads at scale.

  • Incident Response: Lead and contribute to end-to-end response for high-severity network events, including mitigation, stakeholder communication, and postmortem documentation.

  • Root Cause Analysis: Drive RCAs for production incidents, identify systemic issues, and author remediation plans tracked through to closure.

  • Observability Improvements: Contribute to and improve Crusoe's network monitoring stack using streaming telemetry, SNMP, NetFlow, and tools such as Kentik, Grafana, Prometheus, and ThousandEyes.

  • Operational Standards: Author and maintain runbooks, escalation playbooks, and SOPs used across the operations team.

  • Operational Automation: Write Python-based tooling to reduce toil, automate common remediation workflows, and accelerate mean time to resolution.

  • SLI/SLO Contribution: Partner with Architecture and SRE teams to define and track network reliability metrics and service level objectives backed by real-time dashboards.

  • Mentorship: Provide technical guidance to Senior engineers and contribute to a culture of operational excellence and continuous learning.

  • 8+ years of production network engineering experience with a focus on operations, incident response, and reliability in large-scale or internet-scale environments.

  • Hands-on experience with observability and monitoring tools including streaming telemetry, SNMP, NetFlow/sFlow, Grafana, Prometheus, and ThousandEyes.

  • Experience operating RDMA/RoCE lossless fabrics for GPU or HPC workloads, including familiarity with PFC, ECN, and DCQCN tuning.

  • Expert hands-on knowledge of BGP, EVPN-VXLAN, IS-IS, OSPF, MPLS, QoS, and TCP/IP in production data center environments.

  • Proficiency with Arista (EOS) and Juniper (Junos) platforms in leaf-spine CLOS architectures across multi-vendor environments.

  • Python proficiency for writing auto-remediation scripts, diagnostic tooling, and operational automation.

  • Comfort operating large device fleets across multi-region environments with on-call responsibility, including experience as an escalation point during critical events.

  • Bachelor's degree in Computer Science, Electrical Engineering, or a related field, or equivalent practical experience.

Nice to Have

~1 min read
  • Experience with NVIDIA/Mellanox networking platforms in GPU cluster environments.

  • Familiarity with Kentik or Arbor for traffic analysis and DDoS visibility.

  • Experience defining or contributing to SLIs and SLOs in partnership with SRE or product teams.

  • Exposure to operating 10K+ device fleets across hyperscale or cloud environments.

  • Background contributing to post-incident learning programs or operational excellence initiatives org-wide.

What We Offer

~1 min read
Competitive compensation and equity packages
Restricted Stock Units
Paid time off, paid holidays & leave of absence programs
Comprehensive health, dental & vision insurance
Employer contributions to HSA account
Paid parental leave
Paid life insurance, short-term and long-term disability
Professional development & tuition reimbursement
Mental health & wellness support
Commuter benefits (parking & transit)
Cell phone stipend
401(k) Retirement plan with company match up to 4% of salary
Volunteer time off
Global travel insurance & emergency assistance
Daily meals allowance
Additional perks & programs specific to location

Location & Eligibility

Where is the job
San Francisco, United States
On-site at the office
Who can apply
Open to applicants worldwide

Listing Details

Posted
June 5, 2026
First seen
June 6, 2026
Last seen
June 6, 2026

Posting Health

Days active
0
Repost count
0
Trust Level
52%
Scored at
June 6, 2026

Signal breakdown

freshnesssource trustcontent trustemployer trust
Newsletter

Stay ahead of the market

Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.

A
B
C
D
Join 12,000+ marketers

No spam. Unsubscribe at any time.

crusoeStaff Network Engineer, Operations