Senior Production Engineer
Quick Summary
Crusoe is on a mission to accelerate the abundance of energy and intelligence. As the only vertically integrated AI infrastructure company built from the ground up,
Crusoe is on a mission to accelerate the abundance of energy and intelligence. As the only vertically integrated AI infrastructure company built from the ground up, we own and operate each layer of the stack — from electrons to tokens — to power the world's most ambitious AI workloads. When you join Crusoe, you join a team that is building the future, faster.
We're in the midst of the greatest industrial revolution of our time. The demand for AI compute is boundless, and power is a bottleneck. We're solving that — with an energy-first approach that makes AI infrastructure better for the world and faster for the people innovating with AI.
We're looking for problem-solving, opportunity-finding teammates with a sense of urgency, who believe in the scale of our ambition and thrive on a path not fully paved — people who want to grow their careers alongside a team of experts across energy, manufacturing, data center construction, and cloud services.
If you want to do the most meaningful work of your career, help our customers and partners advance their AI strategies, and be part of a high-performing team that believes in each other, come build with us at Crusoe.
About the Role
~1 min readDesign and operate reliable managed AI services with a focus on serving and scaling LLM workloads
Build automation and reliability tooling to support distributed AI pipelines and inference services
Define, measure, and improve SLIs/SLOs across AI workloads to ensure performance and reliability targets are met
Collaborate with AI, platform, and infrastructure teams to optimize large-scale training and inference clusters
Automate observability by building telemetry and performance tuning strategies for latency-sensitive AI services
Investigate and resolve reliability issues in distributed AI systems using telemetry, logs, and profiling
Contribute to the architecture of next-generation distributed systems purpose-built for AI-first environments
Strong software engineering background — experience building production-grade systems beyond scripting or Bash
Demonstrated experience in distributed systems design and implementation
Hands-on work with large language models (LLMs) or AI/ML infrastructure
SRE mindset and experience (whether or not under the SRE title) including:
Defining and measuring SLIs/SLOs
Building monitoring and observability systems
Driving performance and reliability improvements
Designing fault-tolerant systems and automated testing strategies
Proficiency in at least one modern programming language (Python, Go, Java, C++)
Familiarity with Kubernetes or container orchestration platforms
Strong collaboration and communication skills
Ability to thrive in a fast-paced, mission-driven environment
Nice to Have
~1 min readExperience scaling inference or training workloads for LLMs
What We Offer
~1 min readLocation & Eligibility
Listing Details
- Posted
- June 1, 2026
- First seen
- June 2, 2026
- Last seen
- June 2, 2026
Posting Health
- Days active
- 0
- Repost count
- 0
- Trust Level
- 52%
- Scored at
- June 2, 2026
Signal breakdown
Please let crusoe know you found this job on Jobera.
3 other jobs at crusoe
View all →Explore open roles at crusoe.
Similar Production Engineer jobs
View all →Browse Similar Jobs
Stay ahead of the market
Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.
No spam. Unsubscribe at any time.