Manager of Site Reliability Engineering

mid

OtherSite Reliability Engineering

0 views0 saves0 applied

Apply Now

Quick Summary

Overview

Technical Tools

gcpgithub-actionsgrafanajavajenkinsjiraterraformagileci-cdlinuxnetworkingpeople-managementsaas

- Be a Technology Leader by driving the roadmap execution and running the project(s) while planning new ones - Help drive change across the company, working towards a common methodology based around Site Reliability Engineering and Solid System Engineering practices - Lead the team in driving further adoption of Site Reliability practices such as Chaos engineering, SLOs, Error Budgets, release safety, load testing, and disaster recovery strategies - Build teams through hiring and people growth while balancing your ownership workload through delegation and define and review individual and team goals (OKRs) - Responsible for guiding and encouraging the personal and technical development, engagement, and growth of your direct reports - Own application performance, scalability, and availability in production environments - Diagnose and resolve systemic reliability issues across application, OS, and infrastructure layers - Lead major incident response and act as the escalation point for platform-related reliability issues - Ensure post-incident reviews result in measurable improvements to platform stability and application performance - Partner with application teams to influence design decisions that impact runtime reliability - Collaborate cross organization to successfully complete successful delivery with the wider functions, including but not limited to Security, Architecture, Operations and Product Managers - Engineering degree, or a related technical discipline, or equivalent work experience - Knowledge of Public Cloud based applications & Containerization Technologies specifically Google Cloud Platform (compute and storage), GKE, and basic networking. - Demonstrated understanding of best practices in metric generation and collection, log aggregation pipelines, time-series databases, and distributed tracing - Experience transforming teams and successfully leading them through change - 5+ year of people management experience leading a technical team - Deep understanding of Infrastructure as Code (Terraform) and CI/CD automation (Jenkins, GitHub Actions) - Production experience operating Java applications on a multi-regional Google Cloud deployment on Rocky9 Linux virtual machines and containers - Experience with observability through Grafana Cloud (custom metrics, traces, synthetics) and OpenTelemetry agents using the Four Golden signals - Leading incident response and postmortem using using PagerDuty and Grafana Cloud - Cloud Cost, Capacity Planning and Disaster Recovery Planning for cloud based SaaS platforms - Experience working with Agile development methodology and managing dashboards and workflows in JIRA - Experience working in a GCP Cloud environment - Experience with hiring SRE, DevOps, or similar engineering team