Senior Site Reliability Engineer
Quick Summary
Founded in 2017, Obsidian Security was created to close a critical gap: securing the SaaS applications where modern business happens—platforms like Microsoft 365, Salesforce, and hundreds more.
The DevOps/SRE team at Obsidian ensures that engineering excellence translates into stable, scalable, and high-performing production systems. We work closely with Engineering, Quality Engineering, and Customer Support to deliver end-to-end services that bring code to life and maintain our world-class SaaS security platform.
As part of our Sydney team, you will also play a foundational role in building Sherlock — our AI-powered SRE agent — owning the infrastructure that enables autonomous incident detection, root cause analysis, and remediation at scale.
Responsibilities
~1 min read- Support and maintain the service quality of our customer-facing SaaS security platform
- Address complex challenges around scalability, reliability, observability, and cost efficiency
- Collaborate with Engineering teams to maintain and enhance Helm charts, application deployment, monitoring, and CI/CD pipelines
- Embed into the engineering team so that you understand the application deeply
- Define service verification strategies and implement them as part of the CI/CD process to meet SLAs
- Improve developer experience by optimizing CI/CD workflows and performance
- Participate in the on-call rotation, providing 24/7 support in coordination with our global SRE team
- Monitor, debug, and optimize production infrastructure and services on AWS/GCP
- Own and evolve the observability stack: design and maintain Prometheus/Mimir metrics pipelines, Grafana dashboards, Loki log aggregation, and distributed tracing (e.g. Tempo, Jaeger, or OpenTelemetry)
- Define and instrument SLIs/SLOs across services; build alerting strategies that reduce noise and surface actionable signals
- Own the Kubernetes infrastructure for Sherlock: five independently-scaled worker pools, each tuned for its agent’s compute profile with HPA autoscaling
- Design and maintain the CloudSQL schema, migration pipeline, task queue (SKIP LOCKED), and pgvector IVFFlat index for 1,000+ RCA entries
- Build Grafana dashboards covering queue depth, worker latency, agent error rates, accuracy trends, and P50/P95 speed
- Own and maintain the benchmark CI gate in GitLab that blocks any prompt version merge regressing accuracy >5% or speed >15%
- Deliver capacity planning and cost dashboards for Sherlock’s GKE node pools
- By month 3, serve as the primary on-call engineer for all Sherlock infrastructure
- 4+ years of experience in a DevOps or SRE role supporting SaaS services on GCP and/or AWS
- Bachelor’s degree in Computer Science or related field
- Production Kubernetes experience: authored and owns Deployments, HPAs, and resource limits — not just applied YAML
- Strong proficiency in Kubernetes, microservices architecture, Helm, GitLab CI/CD, and ArgoCD
- Deep hands-on experience with the Grafana observability stack: Prometheus/Mimir (metrics), Loki (logs), and distributed tracing (Tempo, Jaeger, or OpenTelemetry)
- Ability to design SLI/SLO frameworks, build alerting rules, and reduce alert fatigue across complex microservices
- PostgreSQL fluency: schema design, indexing, migrations, and query optimisation
- Async / queue-based architecture experience: debugged stuck queues, consumer lag, and duplicate processing
- Programming proficiency in Python or Go
- Strong ownership mindset and comfort with production on-call responsibility
- GCP expertise: Cloud SQL, GKE, IAM, Pub/Sub
- pgvector or other vector database experience
- CI/CD pipeline ownership (GitLab CI or GitHub Actions)
- Familiarity with LLM APIs (Anthropic, Bedrock, or Vertex)
- Understanding of AI agent design patterns and frameworks
- Experience with Kafka, Elasticsearch, ScyllaDB, Databricks, Dagster, Sentry, or Kong
Location & Eligibility
Listing Details
- Posted
- June 1, 2026
- First seen
- June 1, 2026
- Last seen
- June 1, 2026
Posting Health
- Days active
- 0
- Repost count
- 0
- Trust Level
- 60%
- Scored at
- June 1, 2026
Signal breakdown
Please let Obsidiansecurity know you found this job on Jobera.
3 other jobs at Obsidiansecurity
View all →Explore open roles at Obsidiansecurity.
Similar Devops Engineer jobs
View all →Browse Similar Jobs
Stay ahead of the market
Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.
No spam. Unsubscribe at any time.