Senior Cloud Site Reliability Engineer AP

India·NagarFull Timesenior

OtherCloud Site Reliability Engineer

1 views0 saves0 applied

Apply Now

Quick Summary

Requirements Summary

your specific skills and experience, geographic location, or other relevant factors. The salary range for this position may be tailored to be lower or higher in different talent markets.

Technical Tools

OtherCloud Site Reliability Engineer

Lighthouse is built on a foundation of unique, compassionate, highly driven individuals. We elevate the strengths and talents of those around us while leveraging opportunities for growth. We offer the experience of solving complex problems while continuing to grow multiple facets of your career. Lighthouse is where innovation meets support and where collaboration is the key ingredient to success. We grow together and are stronger together.

About the Role

~1 min read

The Senior Cloud Site Reliability Engineer (Senior Cloud SRE) is responsible for ensuring the reliability, scalability, availability, performance, security, and operational excellence of Lighthouse’s cloud platforms and critical product infrastructure.

This role combines software engineering, cloud engineering, automation, observability, and operational governance practices to build highly resilient and self-healing platforms across hybrid and cloud-native environments. The ideal candidate will drive SRE best practices, improve service reliability through automation, establish observability standards, and partner closely with Engineering, Product, Security, DBA, and DevEx teams to improve operational maturity across the organization.

The role requires deep expertise in cloud infrastructure, Kubernetes, DevOps/SRE principles, telemetry, incident management, monitoring, and automation, along with strong collaboration and communication skills.

Drive and implement Site Reliability Engineering (SRE) best practices across cloud platforms and services.
Define, maintain, and improve:
- Service Level Indicators (SLIs)
- Service Level Objectives (SLOs)
- Service Level Agreements (SLAs)
- Error Budgets
Improve service reliability, resiliency, scalability, and operational efficiency.
Establish operational standards, reliability governance, and production readiness practices.
Conduct Root Cause Analysis (RCA), postmortems, and reliability improvement initiatives.
Participate in on-call rotations, incident management, and major incident resolution activities.
Continuously improving operational processes to reduce Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR)

Design, implement, and maintain enterprise observability and telemetry platforms.
Build operational dashboards, reliability scorecards, and service health monitoring solutions.
Configure proactive alerting, anomaly detection, and incident correlation mechanisms.
Implement centralized monitoring and telemetry using:
- Grafana
- Prometheus
- Azure Monitor
- Log Analytics
- ELK Stack / ElasticSearch
- Power BI dashboards
Develop actionable operational metrics and telemetry reporting for engineering and leadership teams.
Enhance visibility into infrastructure, application, Kubernetes, and platform health.

Drive automation-first operational practices across infrastructure and platform services.
Develop Infrastructure-as-Code (IaC) solutions using:
- Terraform
- ARM/Bicep
- Ansible
Build operational automation scripts using:
- Python
- Bash
- PowerShell
Develop self-healing and auto-remediation capabilities for recurring operational incidents.
Automate infrastructure provisioning, monitoring, scaling, backup, recovery, and deployment workflows.
Reduce manual operational effort and improve engineering productivity through intelligent automation.

Collaborate closely with:
- Cloud Engineering teams
- Product Engineering teams
- DevEx teams
- Security teams
- DBA teams
- Operations teams
Support engineering teams in improving production readiness and operational maturity.
Contribute to continuous improvement initiatives, reliability reviews, and operational excellence programs.

Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience/certification).
Knowledge of Python, scripting, or Infrastructure-as-Code tools (e.g., Terraform, Ansible, ARM/Bicep).
Experience managing cloud platforms (e.g., Azure, AKS, Pivotal Cloud Foundry, or equivalent).
Strong understanding of Kubernetes and containerization concepts.
Experience with application packaging, deployment automation, and release management.
Solid knowledge of relational databases (MS-SQL) and exposure to NoSQL technologies (e.g., Redis, ElasticSearch, MongoDB).
Experience with CI/CD tools (Azure DevOps, Jenkins, GitHub Actions, or similar).
Familiarity with monitoring and logging tools (Grafana, ELK stack, Prometheus, PowerBI, etc.).
Proficiency with Git and modern branching/merging workflows.
Strong Linux administration and troubleshooting skills.
Excellent problem-solving, communication, and teamwork skills.

Duties are performed in a typical office environment while at a desk or computer table.
Duties require the ability to use a computer, communicate over the telephone, and read printed material, in a quiet and professional setting.
Duties may require being on call periodically and working outside normal working hours (evenings and weekends).