ukg
ukg14h ago
New

Manager of Site Reliability Engineering

mid
OtherSite Reliability Engineering
0 views0 saves0 applied

Quick Summary

Overview

- Be a Technology Leader by driving the roadmap execution and running the project(s) while planning new ones - Help drive change across the company, working towards a common methodology based around Site Reliability Engineering and Solid System Engineering practices - Lead the team in driving…

Technical Tools
gcpgithub-actionsgrafanajavajenkinsjiraterraformagileci-cdlinuxnetworkingpeople-managementsaas
- Be a Technology Leader by driving the roadmap execution and running the project(s) while planning new ones - Help drive change across the company, working towards a common methodology based around Site Reliability Engineering and Solid System Engineering practices - Lead the team in driving further adoption of Site Reliability practices such as Chaos engineering, SLOs, Error Budgets, release safety, load testing, and disaster recovery strategies - Build teams through hiring and people growth while balancing your ownership workload through delegation and define and review individual and team goals (OKRs) - Responsible for guiding and encouraging the personal and technical development, engagement, and growth of your direct reports - Own application performance, scalability, and availability in production environments - Diagnose and resolve systemic reliability issues across application, OS, and infrastructure layers - Lead major incident response and act as the escalation point for platform-related reliability issues - Ensure post-incident reviews result in measurable improvements to platform stability and application performance - Partner with application teams to influence design decisions that impact runtime reliability - Collaborate cross organization to successfully complete successful delivery with the wider functions, including but not limited to Security, Architecture, Operations and Product Managers - Engineering degree, or a related technical discipline, or equivalent work experience - Knowledge of Public Cloud based applications & Containerization Technologies specifically Google Cloud Platform (compute and storage), GKE, and basic networking. - Demonstrated understanding of best practices in metric generation and collection, log aggregation pipelines, time-series databases, and distributed tracing - Experience transforming teams and successfully leading them through change - 5+ year of people management experience leading a technical team - Deep understanding of Infrastructure as Code (Terraform) and CI/CD automation (Jenkins, GitHub Actions) - Production experience operating Java applications on a multi-regional Google Cloud deployment on Rocky9 Linux virtual machines and containers - Experience with observability through Grafana Cloud (custom metrics, traces, synthetics) and OpenTelemetry agents using the Four Golden signals - Leading incident response and postmortem using using PagerDuty and Grafana Cloud - Cloud Cost, Capacity Planning and Disaster Recovery Planning for cloud based SaaS platforms - Experience working with Agile development methodology and managing dashboards and workflows in JIRA - Experience working in a GCP Cloud environment - Experience with hiring SRE, DevOps, or similar engineering team

Location & Eligibility

Where is the job
Location terms not specified

Listing Details

Posted
May 7, 2026
First seen
May 7, 2026
Last seen
May 7, 2026

Posting Health

Days active
0
Repost count
0
Trust Level
51%
Scored at
May 7, 2026

Signal breakdown

freshnesssource trustcontent trustemployer trust
Newsletter

Stay ahead of the market

Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.

A
B
C
D
Join 12,000+ marketers

No spam. Unsubscribe at any time.

ukgManager of Site Reliability Engineering