Nexla2mo ago

Senior Site Reliability Engineer

India·Bangaloresenior

EngineeringDevops Engineer

4 views0 saves0 applied

Apply Now

Quick Summary

Overview

About Nexla Nexla is the leading Integration platform, built with AI, for AI. Nexla takes a metadata driven approach to converge diverse integrations across Data, Documents, Agents, Applications, and APIs into a single design pattern.

Key Responsibilities

Kubernetes Management: Take end-to-end ownership of Amazon EKS infrastructure, specifically managing the creation, scaling, and seamless version upgrading of clusters.

Requirements Summary

Experience: 8+ years of total experience, with a strong focus specifically in DevOps and SRE. EKS Mastery (Must Have): Proven hands-on experience in the creation and upgrading of Amazon EKS clusters.

Technical Tools

ansibleawsgithub-actionsgitlab-cijenkinskubernetespythonterraformci-cdcybersecuritylinuxnetworking

Nexla is the leading Integration platform, built with AI, for AI. Nexla takes a metadata driven approach to converge diverse integrations across Data, Documents, Agents, Applications, and APIs into a single design pattern. We accelerate the development of solutions for GenAI, Analytics, and Inter-company data. Nexla makes data users and developers up to 10x more productive by delivering a true blend of no-code, low-code, and pro-code interfaces.

Leading companies including DoorDash, LinkedIn, Johnson & Johnson, and LiveRamp trust Nexla for mission-critical data. Named in the 2022, 2023, and 2024 Gartner Magic Quadrant™ for Data Integration Tools and top-rated by customers on Gartner Peer Insights, headquartered in San Mateo, California.

At Nexla, our culture is built around our core values: Have Empathy, Be Curious, Be Intellectually Honest, Achieve Excellence, and Remember to Relax. We put our customers at the heart of everything we do, foster a data-driven mindset, take ownership of our work, and believe in the power of teamwork to achieve ambitious goals.

Responsibilities

~2 min read

→Streaming & Data Plane Reliability: Own the health of our Kafka-based runtime (managed via Strimzi on Kubernetes) - broker health, topic lifecycle and count management, partition and throughput tuning, certificate/secret rotation, and version upgrades - at a scale of hundreds of thousands of topics and hundreds of billions of rows per day.
→Distributed Processing Engines: Operate and tune distributed system workloads in production in collaboration with backend teams, resource allocation, autoscaling, checkpointing, backpressure, and failure recovery for both batch and streaming jobs.
→Stateful Services: Run Redis clusters and other stateful systems reliably - failover, persistence, liveness/readiness tuning, and capacity planning under heavy and bursty load.
→Kubernetes & Operators: Take end-to-end ownership of Amazon EKS, Google GKE and the operators (Strimzi and others) running our stateful data workloads - cluster lifecycle, scaling, version upgrades, and resource governance.
→Observability: Build deep, data-aware monitoring - consumer lag, throughput, partition skew, job latency, error rates - not just host and CPU metrics. Make the data plane's behavior legible before it breaks.
→Incident Management: Lead root-cause analysis for distributed-systems failures (broker outages, crashloops, sink decommissions, control-plane race conditions) and drive durable fixes. Mitigate fast, but design out the recurrence.
→Infrastructure as Code & Automation: Provision and manage cloud infrastructure with Terraform; build operational runbooks and automation, including for air-gapped / private enterprise installs (pre-staged images, operator-facing procedures).
→Collaboration: Partner with platform, runtime, and connector engineering - and with SREs and support - to ship and scale new data-movement features reliably in a large-scale Linux environment.r with SREs, L2/Support, and developers to deploy and scale new product features and improve production monitoring in a large-scale Linux environment.

Requirements

~2 min read

Experience: 8+ years in infrastructure, SRE, or DevOps, with significant time spent operating production distributed data systems (not just application/cloud infra).
Kafka: Deep, hands-on operational experience running Kafka at scale in production - ideally on Kubernetes via Strimzi - including upgrades, topic/partition management, performance tuning, and TLS/secret rotation.
Distributed Processing (Strong Plus): Production experience operating one or more of Spark, Flink, or Ray - resource tuning, checkpointing, failure recovery.
Stateful Systems (Must Have): Production experience with Redis (clustering, persistence, failover) and a solid understanding of operating stateful workloads on Kubernetes (StatefulSets, PVCs, probes, operators).
Data Warehouses: Familiarity operating against Snowflake, BigQuery, or similar, and an understanding of JDBC connectivity and sink reliability.
Kubernetes & EKS: Strong hands-on EKS - cluster creation, scaling, version upgrades, and operator management.
Infrastructure as Code: Advanced proficiency with Terraform.
Programming: Proficiency in Python (or similar) for automation and tooling. Comfort reading and debugging JVM-based systems is a strong plus.
Reliability Mindset: Demonstrated ownership of incident management, RCA, capacity planning, and performance tuning for high-throughput systems.
CI/CD: Solid understanding of CI/CD methodology (Jenkins, GitHub Actions, or GitLab CI) for containerized and non-containerized apps. Supporting, not the core of the role.
Nice to Have: Configuration management (Ansible preferred); broader AWS services (IAM, VPC, EC2, S3, Lambda); AWS CloudFormation.
Soft Skills: Excellent communication and organizational skills; ability to coordinate effectively within a team and with customers.

You own the hard part. The stateful, distributed systems that move billions of rows are the platform's most demanding reliability problems - and they'd be yours.
Impact at scale from day one. Your work keeps mission-critical data flowing for companies like DoorDash and LinkedIn.
The AI wave is real for us. We're not bolting AI onto a legacy product. Intelligent connectors, context-aware data movement, and agentic workflows are the core of what we're building next - on top of the runtime you'd run.

Small team, big problems. Direct access to the CTO, real influence over product direction, and the autonomy to make significant technical bets.
Recognized platform, startup energy. Enterprise validation with the speed and ownership of an early-stage company.

Location
Pune(preferred) or Bengaluru

Why Build Your Future at Nexla? We are standing at the precipice of the GenAI revolution, but the biggest bottleneck isn't the models, it's the data. By joining Nexla, you aren’t just entering a company; you are stepping into the critical layer of the modern data stack that powers the AI economy. We are the Data Fabric that enables industry titans like LinkedIn, DoorDash, and J&J to turn messy, siloed data into ready-to-use products for RAG and predictive models. This is your opportunity to move beyond simple tooling and build the actual infrastructure that democratizes data access for the next decade of innovation. If you want to solve the hardest problems in data engineering and own a piece of a market projected to hit billions, your career belongs here.