Site Reliability Engineer, Platform

Bay Areafull-timemid

EngineeringDevops Engineer

0 views0 saves0 applied

Apply Now

Quick Summary

Overview

About Arena Intelligence Arena Intelligence is the open platform for evaluating how AI models perform in the real world. Created by researchers from UC Berkeley’s SkyLab, our mission is to measure and advance the frontier of AI for real-world use.

Technical Tools

awsdatadogdiscordgithub-actionsgografanajavascriptnextjspostgresqlprometheuspulumipythonterraformtypescriptvercelci-cdoauth

Arena Intelligence is the open platform for evaluating how AI models perform in the real world. Created by researchers from UC Berkeley’s SkyLab, our mission is to measure and advance the frontier of AI for real-world use.

Millions of people use Arena Intelligence each month to explore how frontier systems perform — and we use our community’s feedback to build transparent, rigorous, and human-centered model evaluations. Leading enterprises and AI labs rely on our evaluations to understand real-world reliability, alignment, and impact. Our leaderboards are the gold standard for AI performance — trusted by leaders across the AI community and shaping the global conversation on model reliability and progress.

We’re a team of researchers, engineers, academics, and builders from places like UC Berkeley, Google, Stanford, DeepMind, and Discord. We seek truth, move fast, and value craftsmanship, curiosity, and impact over hierarchy. We’re building a company where thoughtful, curious people from all backgrounds can do their best work. Everyone on our team is a deep expert in their field — our office radiates excellence, energy, and focus.

About the Role

~1 min read

Arena Intelligence is seeking a Site Reliability Engineer to own the reliability, performance, and operational security of the platform that millions of people depend on to evaluate frontier AI. This is the first dedicated SRE hire on the team — you'll build observability, incident response, and infrastructure hardening practices from scratch while also owning the CI/CD and developer tooling that keeps our engineering team moving fast.

Our stack runs on Vercel (Next.js, Hono API on Nitro), Supabase (Postgres, GoTrue auth), Cloudflare (Workers, R2, bot management), and AWS (CloudFront, Lambda). You'll work across the full request path — from edge-layer DDoS mitigation to auth hardening to production monitoring — partnering closely with security and product engineering to keep the platform fast, reliable, and resilient under adversarial traffic conditions.

Harden auth infrastructure against volumetric attacks — edge-layer rate limiting in front of Supabase GoTrue, connection pool tuning, token caching, and origin shielding so DDoS traffic is filtered before it reaches the database
Extend CloudFront WAF rules and Cloudflare Worker bot management to cover auth endpoints and close gaps in application-layer rate limiting
Define and implement SLOs/SLIs across the full request path — CDN edge through serverless functions to Supabase
Build monitoring, alerting, and dashboards on top of existing Datadog and PostHog instrumentation that surface degradations before users notice them
Collaborate with security engineering to ensure clean handoff between edge-layer defenses and application-layer anti-abuse systems
Own and improve CI/CD pipelines (GitHub Actions, Turborepo) and expand infrastructure-as-code (Terraform) across cloud environments
Proactively load-test and stress-test infrastructure, model capacity limits, and drive cost optimization across our multi-cloud footprint
Enhance developer workflows to make building, testing, and deploying faster and more reliable
Mentor engineers across the company on building reliable, performant, and observable systems

6+ years of experience in SRE, platform engineering, or infrastructure engineering, including operating production systems at scale (millions of users / billions of requests)
Direct experience mitigating DDoS attacks and configuring edge security — WAF rules, CDN architecture, rate limiting, and traffic analysis
Hands-on experience building observability systems (Datadog, Grafana, Prometheus, or similar) and running incident response processes
Strong understanding of auth infrastructure under adversarial load — connection pooling, token caching, and rate limiting on login/signup endpoints
Experience with serverless architectures and managed platforms — you know how to make them reliable and observable at scale
Experience with infrastructure-as-code (Terraform, Pulumi) and CI/CD pipeline design
Track record of collaborating with security and product engineering to deliver both foundational systems and user-facing reliability improvements

Nice to Have

~1 min read

Experience with Vercel, Supabase (GoTrue, Supavisor), Cloudflare Workers, or CloudFront specifically.
Experience with Node.js, TypeScript, Python, or Go in production backend environments.
Background in platforms with voting, reputation, or community-driven systems.
Experience being the first or early infrastructure hire at a startup.
Experience hardening auth systems under load (OAuth, JWT, PKCE flows, connection pooling).

What We Offer

~1 min read

✓We offer competitive compensation and equity aligned to the markets where our team members are based. The base salary range will depend on the candidate’s permanent work location.

✓Comprehensive health and wellness benefits, including medical, dental, vision, and additional support programs.

✓The opportunity to work on cutting-edge AI with a small, mission-driven team

✓A culture that values transparency, trust, and community impact