Coupang
Coupang3h ago
New

Senior Staff Cloud Backend Engineer - Observability and Site Reliability

Bengalurusenior
OtherCloud Backend Engineer
0 views0 saves0 applied

Quick Summary

Overview

Company Introduction We exist to wow our customers. We know we’re doing the right thing when we hear our customers say,

Technical Tools
OtherCloud Backend Engineer

As a Senior Staff Data Centre Observability and Site Reliability Engineer, you will design, build, and operate scalable observability and reliability solutions for large-scale datacenter infrastructure. This role focuses on developing high-performance monitoring and telemetry platforms, ensuring system reliability, and driving operational excellence through automation, performance optimization, and SRE best practices. The ideal candidate will work across the full service lifecycle—design, deployment, and continuous improvement—while collaborating with cross-functional teams to enhance visibility, resilience, and efficiency of critical systems.

Responsibilities

~1 min read
  • Design, implement, and maintain observability solutions for datacenter infrastructure, including monitoring, logging, alerting, and telemetry systems.
  • Develop, deploy, and operate large-scale observability and telemetry platforms with a focus on real-time monitoring, high performance, and scalability.
  • Own and contribute to the full lifecycle of observability services—from design and development to deployment and ongoing optimization.
  • Build and enhance monitoring systems to ensure high availability, reliability, and performance of infrastructure.
  • Create and manage dashboards, alerts, and reports to provide clear visibility into system health, performance, and capacity trends.
  • Apply SRE principles and best practices to improve reliability, scalability, and operational efficiency of datacenter services.
  • Develop and maintain automation for infrastructure provisioning, monitoring, and system management.
  • Lead root cause analysis (RCA) and post-incident reviews, driving corrective actions to prevent recurrence and improve system resilience.
  • Analyze system and application performance across the datacenter infrastructure to identify bottlenecks and improvement areas.
  • Implement optimization strategies to enhance performance, efficiency, and resource utilization.
  • Partner with cross-functional engineering teams to understand observability and reliability requirements and deliver effective solutions.
  • Collaborate with hardware and software vendors to evaluate, integrate, and optimize new technologies within the ecosystem.
  • Ensure observability and reliability solutions adhere to organizational security policies and industry standards.
  • Implement and maintain appropriate security controls to safeguard infrastructure, systems, and data.
  • Provide hands-on support for observability and reliability issues, including debugging complex hardware and software problems.
  • Develop and maintain documentation, including troubleshooting guides and operational best practices, to support efficient issue resolution.
  • Stay current with emerging trends, tools, and technologies in observability and SRE, and incorporate them into the platform.
  • Continuously enhance the scalability, reliability, and operational efficiency of datacenter services through proactive improvements.

 

Requirements

~1 min read
  • Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field. 
  • 12+ years of progressive software engineering experience, with a heavy emphasis on distributed systems, cloud-native architectures, or platform operations. 
  • Proven experience in managing and optimizing large-scale datacenter environments

  • Strong proficiency in Go or Python, with a deep understanding of networked systems and performance optimization. 
  • Expert-level knowledge of Kubernetes internals (scheduling, controllers) and containerization ecosystems. 
  • Proven experience with load balancing, service mesh, and request routing at scale. 
  • Proficiency in observability tools and technologies (e.g., Prometheus, Grafana, ELK Stack).

  • Experience with SRE practices and tools (e.g., Kubernetes, Docker, Terraform).

  • Familiarity with cloud platforms (AWS, Azure, GCP) and their observability and reliability services

 

Requirements

~1 min read
  • Prior experience building infrastructure specifically for LLM inference or large-scale training clusters. 
  • Familiarity with inference, including mixed precision, kernel tuning, or custom hardware accelerators. 
  • Experience managing hybrid-cloud or multi-AZ deployments across AWS, Azure, or GCP. 
  • Experience operating in regulated environments with strict security and compliance requirements

 

  • Hybrid
    Our Hybrid work model: Coupang hybrid work model is designed to enable a culture of collaboration that acts a catalyst to enrich the experience of employees. Employees are required to work at least 3 days in the office per week, with the flexibility to work from home 2 days a week, depending on the role requirement. Some businesses may require more time in office due to nature of work.  

Details to consider 

  • Those eligible for employment protection (recipients of veteran’s benefits, the disabled, etc.) may receive preferential treatment for employment in accordance with applicable laws. 
     

Privacy Notice  

Location & Eligibility

Where is the job
Bengaluru
On-site at the office
Who can apply
Same as job location

Listing Details

Posted
May 14, 2026
First seen
May 14, 2026
Last seen
May 14, 2026

Posting Health

Days active
0
Repost count
0
Trust Level
67%
Scored at
May 14, 2026

Signal breakdown

freshnesssource trustcontent trustemployer trust
Coupang
Coupang
greenhouse

Coupang is a U.S. retail company known for its fast delivery services and commitment to customer satisfaction.

Employees
5k+
Founded
2010
View company profile
Newsletter

Stay ahead of the market

Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.

A
B
C
D
Join 12,000+ marketers

No spam. Unsubscribe at any time.

CoupangSenior Staff Cloud Backend Engineer - Observability and Site Reliability