onebrief
onebrief1mo ago
New
$180K – $220K • Offers Equity/yr

Senior Site Reliability Engineer (Arlington, VA)

Northern Virgina (dc Metro)full-timesenior
EngineeringDevops Engineer
0 views0 saves0 applied

Quick Summary

Key Responsibilities

Implementing a World-Class Observability Platform: Design, implement, and manage our monitoring, logging, and alerting stack (e.g., Prometheus, Loki, Alloy, and Grafana).

Technical Tools
EngineeringDevops Engineer

Onebrief is collaboration and AI-powered workflow software designed specifically for military staffs. By transforming this work, Onebrief makes the staff as a whole superhuman - meaning faster, smarter, and more efficient.

We take ownership, seek excellence, and play to win with the seriousness and camaraderie of an Olympic team. Onebrief operates as an all-remote company, though many of our employees work alongside our customers at military commands around the world.

Founded in 2019 by a group of experienced planners, today, Onebrief’s team spans veterans from all forces and global organizations, and technologists from leading-edge software companies. We’ve raised $320m+ from top-tier investors, including Battery Ventures, General Catalyst, Sapphire Ventures, Insight Partners, and Human Capital, and today, Onebrief is valued at $2.15B. With this continued growth, Onebrief is able to make an impact where it matters most.

This role requires regularly working on-site at customer locations in Arlington, VA.

If you are not currently within commuting distance, you must be willing to relocate (note that Onebrief will provide relocation assistance).

About the Role

~1 min read

We are hiring a Site Reliability Engineer to join our Infrastructure & Security team. You’ll work closely with fellow SREs, security, and customer success.

You will be the first line of support for our mission critical deployments, and responsible for ensuring best-in-class service quality and issue resolution. You will work in both on-premise DoD environments and AWS cloud environments. Your lessons from the field will shape how our team works, from policy to implementation.

In addition to working at the customer, you will contribute directly to solutions that increase stability, performance, and security of our deployments, and improve the overall experience of deploying and managing Onebrief on premise.

You care deeply about reliability and treat it as a core feature of any application or platform, with a bias toward “reliability over novelty.” You think about infrastructure and operability as products to be automated, well-documented, and continuously improved, and you aim to leave systems easier to operate than you found them.

You are equally comfortable leading a post-incident review, or diving into a kubectl shell to triage a complex production issue. You don't just fix problems; you translate constraints and failure modes into clear, automated guardrails and scalable, resilient architecture. For you, robust monitoring, actionable alerting, and insightful runbooks are core parts of the engineering process, not afterthoughts.

You mentor others, fostering a culture of blameless postmortems and proactive reliability. You collaborate naturally with application and platform teams, helping them move quickly but safely by building the tools, processes, and observability that make "fast recovery" a reality.

Responsibilities

~1 min read

You'll own the reliability, scalability, and security of the production application and/or platform. You will do this by:

  • Infrastructure as Code: Terraform (or CloudFormation), Ansible.

  • Containers and orchestration: Kubernetes design, deployment, and operations.

  • CI/CD: experience building and maintaining pipelines (GitLab CI/CD, Jenkins, GitHub Actions).

  • Scripting: proficiency with at least one of Python, Go, or Bash.

  • Cloud: Familiarity with AWS or AWS GovCloud.

  • Observability: Grafana stack, ELK stack, or Datadog.

  • Networking fundamentals: core protocols and secure configurations.

Nice to Have

~1 min read
  • Experience in DoD environments and compliance frameworks (RMF, STIGs, ICD 503).

  • GitOps practices and toolchains.

  • Security‑minded design for sensitive environments.

  • Experience designing and implementing meaningful SLIs/SLOs (including error budgets) for complex, distributed systems.

  • Familiarity with on‑prem virtualization(VMware, Proxmox, Nutanix, Hyper-V, etc).

  • Service mesh exposure (Istio, Linkerd).

  • Relevant certifications (e.g., AWS DevOps Engineer, CKA/CKAD).

  • Active Security+ or another DoD 8570.01-approved security credential, or the ability to obtain the valid credentials within 3 months of employment.


Notice to Third Party Recruitment Agencies

Please note that Onebrief does not accept unsolicited resumes from recruiters or employment agencies. In the absence of an executed Recruitment Services Agreement, there will be no obligation to any referral compensation or recruiter fee. In the event a recruiter or agency submits a resume or candidate without an agreement Onebrief explicitly reserves the right to pursue and hire those candidate(s) without any financial obligation to the recruiter or agency. Any unsolicited resumes, including those submitted to hiring managers, shall be deemed the property of Onebrief.

Location & Eligibility

Where is the job
Northern Virgina (dc Metro)
On-site at the office
Who can apply
Same as job location

Listing Details

Posted
April 8, 2026
First seen
May 19, 2026
Last seen
May 19, 2026

Posting Health

Days active
0
Repost count
0
Trust Level
25%
Scored at
May 20, 2026

Signal breakdown

freshnesssource trustcontent trustemployer trust
Newsletter

Stay ahead of the market

Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.

A
B
C
D
Join 12,000+ marketers

No spam. Unsubscribe at any time.

onebriefSenior Site Reliability Engineer (Arlington, VA)$180K – $220K • Offers Equity