Senior Site Reliability Engineer
Quick Summary
Saviynt's AI-powered identity platform manages and governs human and non-human access to all of an organization's applications, data, and business processes.
We’re a fast-moving AI Security Company building AI-native infrastructure and applications powered by LLMs and autonomous agents. Our stack is deeply integrated with AWS, Kubernetes, and OpenAI-based systems, and we’re rethinking reliability in a world where software can reason, adapt, and self-heal.
We’re hiring a Senior SRE Engineer to own reliability across our cloud-native and AI-driven platform. You’ll work at the intersection of distributed systems, Kubernetes operations, and LLM-powered automation, building systems that don’t just scale—but think and fix themselves.
- 5+ years in SRE / DevOps / Platform Engineering.
- Strong hands-on experience with:
- AWS infrastructure at scale
- Kubernetes (production-grade clusters)
- Proven ability to debug complex distributed systems under pressure.
- Strong coding skills (Python or Go)—you build internal platforms and tools.
- Experience implementing monitoring, alerting, and incident management systems.
Nice to Have
~1 min read- Experience working with LLM APIs such as the OpenAI API.
- Familiarity with agent frameworks like:
- LangChain
- AutoGen
- Built or experimented with:
- AI agents for DevOps / SRE workflows
- Retrieval-Augmented Generation (RAG) systems
- Vector databases (Pinecone, Weaviate, etc.)
- Exposure to AIOps or intelligent automation systems.
- Own uptime, reliability, and performance of services running on AWS + Kubernetes (EKS).
- Design and implement self-healing infrastructure using automation and AI agents.
- Build LLM-powered operational tooling using APIs such as the OpenAI API for:
- Intelligent alert triage
- Incident summarization
- Root cause analysis
- Runbook automation
- Manage and scale Kubernetes workloads:
- Deployments, autoscaling, resource optimization
- Cluster reliability and cost efficiency
- Build and evolve observability systems:
- Metrics (Prometheus), dashboards (Grafana)
- Logs (ELK / OpenSearch)
- Tracing (OpenTelemetry)
- Define and enforce SLOs, SLAs, and error budgets tied to business metrics.
- Automate infrastructure using Terraform and CI/CD pipelines.
- Lead incident response, postmortems, and continuous reliability improvements.
- Introduce chaos engineering practices to proactively test system resilience.
Location & Eligibility
Listing Details
- Posted
- September 25, 2025
- First seen
- March 26, 2026
- Last seen
- April 28, 2026
Posting Health
- Days active
- 32
- Repost count
- 0
- Trust Level
- 33%
- Scored at
- April 28, 2026
Signal breakdown

Saviynt is a leading provider of cloud-native identity and governance platform solutions, empowering enterprises to secure their digital transformation, safeguard critical assets, and meet regulatory compliance.
View company profilePlease let Saviynt know you found this job on Jobera.
4 other jobs at Saviynt
View all →Explore open roles at Saviynt.
Similar Devops Engineer jobs
View all →Browse Similar Jobs
Stay ahead of the market
Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.
No spam. Unsubscribe at any time.