Okta1mo ago

Principal Site Reliability Engineer

India·Bangalorelead

EngineeringDevops Engineer

4 views0 saves0 applied

Apply Now

Quick Summary

Key Responsibilities

Kubernetes (EKS/GKE), Terraform, Helm, Git, ArgoCD, Gitops Programming: Golang, Python Observability: Datadog, Splunk Data Stores: PostgreSQL, Redis,

Technical Tools

EngineeringDevops Engineer

Okta is The World’s Identity Company. We free everyone to safely use any technology—anywhere, on any device or app. Our Workforce and Customer Identity Clouds enable secure yet flexible access, authentication, and automation that transforms how people move through the digital world, putting Identity at the heart of business security and growth.

At Okta, we celebrate a variety of perspectives and experiences. We are not looking for someone who checks every single box, we’re looking for lifelong learners and people who can make us better with their unique experiences.

Join our team! We’re building a world where Identity belongs to you.

We are seeking a Principal Site Reliability Engineer to serve as a technical leader for reliability engineering within Okta's Emerging Products Group (EPG).

This role extends beyond operating production systems. You will define technical strategy, influence platform architecture, establish reliability standards, and lead transformational initiatives that improve scalability, resilience, security, and operational excellence for one of Okta's fastest-growing product areas.

Initially, you will partner closely with the Spera / Identity Security Posture Management (ISPM) engineering organization to establish reliability strategy, operational excellence, and platform maturity. Over time, you will help drive broader reliability initiatives across EPG and contribute to the evolution of reliability engineering practices across multiple products including Workflows, IGA, PAM, and ISPM.

You will work closely with engineering leadership, product leadership, architects, and Staff engineers to shape the future of Okta's cloud infrastructure and reliability practices.

The ideal candidate combines deep technical expertise with strong organizational influence and has a proven track record of leading large-scale engineering initiatives that drive measurable business outcomes.

Responsibilities

~1 min read

Define and drive the reliability strategy for critical product and platform services.
Establish standards for availability, resilience, observability, incident management, and operational readiness.
Lead architecture reviews for critical services and platform initiatives.
Partner with engineering leaders to ensure reliability objectives align with business priorities and customer expectations.
Create frameworks, standards, and operational guardrails that enable engineering teams to operate safely at scale.
Guide service architecture toward simplicity, scalability, resilience, and operational excellence.
Drive major initiatives that improve platform maturity and long-term sustainability.

Own reliability architecture and operational excellence for the Spera / ISPM product area.
Collaborate closely with engineering leadership to establish reliability objectives and technical roadmaps.
Lead large-scale scalability, resiliency, and performance initiatives.
Partner with platform and product engineering teams to build self-service operational capabilities that improve developer productivity while strengthening reliability and security.
Influence technical direction through data-driven recommendations, engineering expertise, and collaborative leadership.
Support highly available, large-scale cloud environments as part of an on-call rotation.

Design, build, and operate large-scale cloud infrastructure and production services.
Develop software, automation, and infrastructure using Go, Python, Terraform, and related technologies.
Eliminate operational toil through automation, tooling, and platform engineering.
Improve deployment safety, operational workflows, and platform consistency through GitOps and Infrastructure-as-Code practices.
Collaborate on modernizing existing workloads and aligning them with evolving platform capabilities.
Lead complex engineering initiatives from conception through production rollout and long-term operational ownership.

Mentor Staff and Senior engineers across multiple teams and organizations.
Lead technical reviews, design reviews, and operational readiness assessments.
Build engineering consensus across teams with differing priorities and objectives.
Help develop the next generation of technical leaders within Okta.
Drive adoption of reliability engineering best practices across EPG.
Share patterns, tooling, and operational practices across Workflows, Inbox, PAM, and ISPM teams.
Influence technical direction through expertise, collaboration, and execution rather than organizational authority.

Lead the exploration and adoption of AI-assisted reliability engineering practices across EPG.
Design and champion agentic systems that accelerate troubleshooting, incident response, root-cause analysis, and operational decision-making.
Evaluate emerging AI technologies and identify practical opportunities to improve reliability engineering workflows.
Establish best practices for safe, effective, and measurable use of AI within production operations.
Drive initiatives that reduce operational toil and improve engineering productivity through intelligent automation.

Infrastructure/Orchestration: Kubernetes (EKS/GKE), Terraform, Helm, Git, ArgoCD, Gitops
Programming: Golang, Python
Observability: Datadog, Splunk
Data Stores: PostgreSQL, Redis, OpenSearch

Extensive experience designing and operating large-scale production systems in AWS and/or GCP.
Deep expertise with Kubernetes in production environments.
Experience designing reliability strategies for Kubernetes-based platforms.
Strong expertise troubleshooting Kubernetes networking, storage, scheduling, scaling, and workload lifecycle challenges.
Extensive experience with Infrastructure as Code technologies such as Terraform and Helm.
Strong software engineering skills in Golang and/or Python.
Experience building internal platforms, developer tooling, and operational automation.
Deep understanding of distributed systems architecture and cloud-native application design.
Strong understanding of cloud networking fundamentals including DNS, service discovery, ingress, load balancing, TLS, traffic management, and multi-region architectures.
Experience operating and troubleshooting distributed data platforms such as PostgreSQL, Redis, OpenSearch, MySQL, Cassandra, or similar technologies.
Experience establishing observability standards, monitoring strategies, and operational best practices across engineering organizations.
Experience with or strong interest in AI-assisted engineering and operational automation.

Strong expertise operating customer-facing production systems at scale.
Deep understanding of reliability engineering principles including SLIs, SLOs, error budgets, capacity planning, and resilience engineering.
Experience leading major incident response efforts and driving long-term operational improvements.
Strong understanding of CI/CD, GitOps, deployment strategies, and automation-first operational practices.
Proven success driving large-scale reliability transformations and architectural modernization efforts.
Ability to balance reliability, scalability, security, customer experience, and engineering velocity.

Strong understanding of cloud security fundamentals, IAM, secrets management, and secure infrastructure design.
Experience operating systems within security-sensitive or regulated environments is a plus.
Familiarity with operational controls, compliance requirements, and security best practices in cloud-native environments.

Demonstrated success leading complex technical initiatives across multiple teams and organizations.
Demonstrated ability to drive technical strategy and influence engineering outcomes across multiple organizations and leadership teams.
Proven ability to influence technical direction without direct organizational authority.
Experience working effectively within globally distributed engineering organizations spanning multiple timezones and cultures.
Strong collaboration, communication, and stakeholder management skills.
Experience mentoring Staff engineers and helping develop future technical leaders.
Ability to translate business objectives into technical strategy and measurable reliability outcomes.
Experience working closely with engineering leadership, architects, and product leaders to drive organizational change.

Requirements

~1 min read

Experience operating SaaS platforms serving millions of users.
Experience supporting globally distributed production environments.
Experience leading platform engineering or reliability transformation initiatives.
Experience implementing AI-assisted operational tooling, agentic workflows, or intelligent automation platforms.
Prior experience as a Staff, Principal, or equivalent senior technical leader in Site Reliability Engineering, Platform Engineering, Infrastructure Engineering, or Cloud Operations.

#LI-Hybrid
#P25309_3469886

We are intentional about connection. Our global community, spanning over 20 offices worldwide, is united by a drive to innovate. Your journey begins with an immersive, in-person onboarding experience designed to accelerate your impact and connect you to our mission and team from day one.

Okta is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, ancestry, marital status, age, physical or mental disability, or status as a protected veteran. We also consider for employment qualified applicants with arrest and convictions records, consistent with applicable laws.

If reasonable accommodation is needed to complete any part of the job application, interview process, or onboarding please use this Form to request an accommodation.

Notice for New York City Applicants & Employees: Okta may use Automated Employment Decision Tools (AEDT), as defined by New York City Local Law 144, that use artificial intelligence, machine learning, or other automated processes to assist in our recruitment and hiring process. In accordance with NYC Local Law 144, if you are an applicant or employee residing in New York City, please click here to view our full NYC AEDT Notice.