Senior Site Reliability Engineer

India·Hyderabadmid

EngineeringDevops Engineer

0 views0 saves0 applied

Apply Now

Quick Summary

Requirements Summary

you can navigate ambiguous, high-pressure production issues, drive coordinated response, and follow through with durable improvements.

Technical Tools

EngineeringDevops Engineer

As a Senior Site Reliability Engineer, you will play a pivotal role in managing and owning critical incidents through to resolution to minimize business impact on our key critical business applications and our customer’s business operation and serving as a reliability advisor focused on improving the resilience, observability, and operability of critical platforms and services.

Lead incident response for business-critical services coordinating cross-functional teams and suggest troubleshooting to restore service quickly and minimize business impact.
Proactively notifies internal stakeholders of potential issues impacting service performance; provides regular status updates as required.
Drive blameless postmortems and root cause analysis, turning incidents and recurring issues into systemic fixes, corrective actions, and long-term reliability improvements.
Proactively identify and reduce sources of instability in systems by analyzing how our systems fail in production and driving architectural or operational improvements.
Serve as a senior technical resource and reliability advisor to internal teams, sharing best practices and guiding teams toward sustainable operational excellence.
Partner with engineering/product to shift-left reliability: design/readiness reviews, resilience reviews, and operational acceptance for launches and changes.
Champion culture of reliability across business domains, act as a force multiplier: create clear documentation that enables other teams to adopt reliability improvements at scale.

Requirements

~1 min read

Experience with incident management and response: you can navigate ambiguous, high-pressure production issues, drive coordinated response, and follow through with durable improvements.
Track record of proactively identifying reliability risks and gaps through metrics, incidents, architecture reviews, or resilience testing.
Exceptional problem solving, critical systems thinking skills, and familiarity with chaos engineering concepts.
Strong collaboration and influence skills: you communicate clearly, build trust with partner teams, and can guide engineering teams toward better reliability practices.
Growth mindset and curiosity: you are eager to learn, comfortable challenging assumptions (including your own), and motivated by continuous improvement of systems, processes, and yourself.
Minimum of 4-6 years of relevant experience or equivalent combination of education and experience in Senior Incident Management (with tech focus), SRE, Production Engineering, Software Engineering, DevOps Engineering, or similar role operating business-critical, high-traffic services in production.
Good business English skills (Written and spoken).
Diploma or equivalent work experience required.

Fluency in modern infrastructure: proven hands-on experience with public cloud technologies and understanding of containerized and orchestrated platforms such as Kubernetes.
Practical experience or a strong demonstrated interest in operating LLM-based systems, RAG pipelines, or agentic workloads, and understanding the reliability challenges of non-deterministic systems.
General knowledge of Diebold Nixdorf products and services is a plus.