Director, Site Reliability Engineering & Cloud Operations (SRE)

United States·AustinHybridexecutive

OtherSite Reliability Engineering

0 views0 saves0 applied

Apply Now

Quick Summary

Overview

At Resideo, we imagine a world where homes and buildings are good for the planet, and where technology works to simplify everyday life. In that world, people are healthy, happy, and secure.

Technical Tools

OtherSite Reliability Engineering

At Resideo, we imagine a world where homes and buildings are good for the planet, and where technology works to simplify everyday life. In that world, people are healthy, happy, and secure. To help create this future, we will work every day to simplify the connected world so people have peace of mind and can focus on what matters most. Resideo is making a large investment in our engineering group. With global reach and impact, we are dedicated to an investment in building our team as we develop new products and introduce them to consumers around the world (NPI). Being an established leader in the connected products space, we will give you a platform to work on new and innovative projects as a member of a team of intelligent innovators that are developing products that truly align with our mission of protecting what matters most.

This is an exciting opportunity to lead cloud operations for one of the largest IoT ecosystems in the world, shaping the future of cloud infrastructure, SRE, and AI-driven operations. You'll work alongside world-class engineering talent and cutting-edge technologies to ensure Resideo’s mission of simplifying everyday life through innovative connected products. As a leader, you will have the opportunity to lead the platform engineering transformation in a global organization of multiple teams in delivering on business priorities while collaborating with development leaders and executives to define and advance best practices.

Resideo is seeking a strategic and experienced leader to oversee the global cloud infrastructure, Site Reliability Engineering (SRE) for our large-scale, connected products ecosystem and CloudOps. This role will drive the performance, reliability, security, and operational excellence of our multi-cloud environments (Azure), supporting millions of IoT devices and trillions of data points and events. The ideal candidate will have deep expertise in cloud infrastructure, IoT, and large-scale SaaS platforms, and be passionate about fostering a culture of innovation, reliability, and automation.

Cloud Infrastructure & SRE Strategy
- Define and execute global cloud operations and SRE strategies, ensuring 99.999%+ uptime for mission-critical IoT services.
- Architect, implement, and optimize multi-cloud infrastructure to support IoT devices with low-latency data processing, scalability, and high availability.
- Drive cost optimization strategies while balancing performance, redundancy, and financial efficiency across cloud platforms (Azure).
- Develop automated deployment, monitoring, and recovery systems using technologies like Kubernetes, Terraform, Ansible, and CI/CD pipelines.
Reliability, Performance & Incident Management
- Establish and refine SLOs, SLIs, and KPIs for service reliability, performance, and capacity planning.
- Build and optimize incident management, disaster recovery, and resilience engineering frameworks.
- Leverage AI/ML-driven automation for proactive failure detection and remediation.
Security & Compliance
- Implement robust security practices and ensure cloud security, compliance with standards such as SOC2, GDPR, and NIST, and oversee the zero-trust security model for IoT data protection.
- Collaborate with security and compliance teams to manage risk and ensure regulatory adherence across cloud platforms.
Team Leadership & Cross-Functional Collaboration
- Lead and mentor a global team of Cloud Engineers, SREs, and SW professionals, fostering a culture of continuous learning and innovation.
- Partner with product management, software engineering, and customer support to optimize IoT device onboarding, firmware updates, and cloud-to-edge performance.
- Collaborate with finance and executive leadership to develop long-term cloud investment strategies.

Requirements

~1 min read

15 + years in Computer Science, Electrical Engineering, or a related field
15+ years of experience in Cloud Operations, SRE, or Infrastructure Engineering, with 8+ years in technical leadership roles
5+ years of experience managing large-scale, distributed IoT cloud environments supporting billions of data points per day
5+ years of deep professional experience in Azure cloud platforms including networking, storage, compute, and database services
5+ years of experience in Kubernetes, Terraform, CI/CD pipelines, and observability tools (e.g., Prometheus, Grafana, ELK, etc.)
5+ years of experience in large-scale systems design and architecture, with a focus on reliability, performance, and scalability of cloud-native platforms
5+ years of hands-on experience with tools like Terraform, Ansible, CDK, Pulumi for Infrastructure-as-Code (IaC), and managing cloud-native architectures

Strong background in AI/ML-driven automation for cloud infrastructure monitoring, self-healing, and optimization
Solid understanding of security-first cloud architectures, DevSecOps, and compliance standards (SOC2, GDPR, NIST)
Proven ability to manage teams across multiple global time zones, ensuring operational excellence and driving performance in large, distributed environments
Expertise in incident management, disaster recovery, and building resilience engineering frameworks
Ability and desire to review code, system designs, and engage in system engineering discussions and decisions
Experience managing Consumer IoT ecosystems with large-scale sensor data processing and real-time analytics
Expertise in serverless architecture, edge computing, and IoT protocol optimization
Strong financial acumen in cloud cost management, and forecasting
Familiarity with regulatory compliance frameworks such as SOC2, GDPR, and ISO 27001
Relevant certifications, such as Azure Expert