bespokelabs
New

DevOps / Site Reliability Engineer

Remotecontractmid
EngineeringDevops Engineer
0 views0 saves0 applied

Quick Summary

Overview

About Bespoke Labs Bespoke Labs is an AI research and data company building the datasets, benchmarks, and evaluation infrastructure that power frontier AI models. We're backed by leading investors, trusted by top AI labs, and have research accepted at venues like ICLR 2026.

Key Responsibilities

We're looking for a mid-level DevOps / Site Reliability Engineer to own and scale our cloud infrastructure. You'll work closely with engineering and ML teams to keep our systems reliable, observable, and fast — directly supporting the infrastructure…

Requirements Summary

3–5 years in DevOps, SRE, or infrastructure engineering Strong AWS experience — EKS, EC2, RDS, S3, IAM Kubernetes — deployment, scaling, troubleshooting in production CI/CD pipelines — GitHub Actions, ArgoCD, or similar Infrastructure as Code —…

Technical Tools
argocdawsdatadoggithub-actionsgrafanakubernetesprometheuspulumipythonterraformci-cddistributed-systemsetltechnical-writing

Bespoke Labs is an AI research and data company building the datasets, benchmarks, and evaluation infrastructure that power frontier AI models. We're backed by leading investors, trusted by top AI labs, and have research accepted at venues like ICLR 2026. Our team is small, moves fast, and has an outsized impact on how the next generation of AI is built.

We're looking for a mid-level DevOps / Site Reliability Engineer to own and scale our cloud infrastructure. You'll work closely with engineering and ML teams to keep our systems reliable, observable, and fast — directly supporting the infrastructure that powers AI data pipelines at scale.

Responsibilities

~1 min read
  • Own cloud infrastructure on AWS — EC2, EKS, RDS, S3, IAM, VPC

  • Manage Kubernetes clusters and container orchestration end-to-end

  • Build and maintain CI/CD pipelines using GitHub Actions or similar

  • Implement monitoring, alerting, and observability stacks (Prometheus, Grafana, or DataDog)

  • Improve reliability, performance, and security of production systems

  • Automate infrastructure with Terraform or similar IaC tools

  • Debug and resolve issues across complex, distributed systems

  • Participate in design reviews and help raise the infrastructure bar

  • 3–5 years in DevOps, SRE, or infrastructure engineering

  • Strong AWS experience — EKS, EC2, RDS, S3, IAM

  • Kubernetes — deployment, scaling, troubleshooting in production

  • CI/CD pipelines — GitHub Actions, ArgoCD, or similar

  • Infrastructure as Code — Terraform, Pulumi, or CDK

  • Python or Go scripting

  • Experience working in production environments with real users

  • Comfort with ambiguity and ability to operate autonomously

Nice to Have

~1 min read
  • Experience supporting ML training workloads or GPU clusters

  • Familiarity with distributed computing or large-scale data pipelines

  • Prior work at an AI, ML, or data company

  • Open-source contributions or published technical writing

What We Offer

~1 min read
Competitive compensation and meaningful equity
Direct impact on frontier AI model training and evaluation infrastructure
Flexible, remote-friendly environment with low bureaucracy
A small, high-caliber team with deep AI research expertise
Health, wellness, and learning & development benefits

Location & Eligibility

Where is the job
Worldwide
Fully remote, anywhere in the world
Who can apply
Same as job location

Listing Details

Posted
May 5, 2026
First seen
May 6, 2026
Last seen
May 8, 2026

Posting Health

Days active
0
Repost count
0
Trust Level
59%
Scored at
May 6, 2026

Signal breakdown

freshnesssource trustcontent trustemployer trust
Newsletter

Stay ahead of the market

Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.

A
B
C
D
Join 12,000+ marketers

No spam. Unsubscribe at any time.

bespokelabsDevOps / Site Reliability Engineer