grubtech
grubtech4d ago
New

Site Reliability Engineer

Colombo 07Contractormid
EngineeringDevops Engineer
0 views0 saves0 applied

Quick Summary

Overview

Grubtech is a unified commerce engine purpose-built for the food and beverage industry.

Technical Tools
EngineeringDevops Engineer

Grubtech is a unified commerce engine purpose-built for the food and beverage industry. We serve a wide 
range of customers - from SMBs to mid-market and enterprise brands - helping them manage and scale 
their operations across multiple digital and physical channels. 
Our platform integrates online ordering, POS, delivery aggregators, loyalty, and more - giving restaurants 
the tools they need to thrive in a digital-first world. 



Role Overview 
This is a key role focused on improving the reliability, availability, performance, and operational maturity 
of Grubtech's production systems. This individual will manage and improve AWS-based cloud 
environments, including ECS-based workloads, strengthen monitoring, alerting, logging, and observability 
capabilities, and support effective incident management for mission-critical workloads. The role will 
partner closely with application, DevOps, infrastructure, and support teams to prevent incidents, respond 
quickly when issues occur, improve production readiness, and reduce operational toil through automation 
and continuous improvement. 


Profile: 
• Bachelor’s degree in computer science, Software Engineering or related field. 


• Minimum 5 years of hands-on experience in Site Reliability Engineering, DevOps, cloud platform 
engineering, infrastructure operations, or production engineering. 


• Strong hands-on experience operating, troubleshooting, and improving production workloads in 
AWS; Azure or on-prem deployments would be an added advantage. 


• Experience with core AWS services and production operations, including VPC, EC2, ECS, IAM, Load 
Balancers, CloudWatch, RDS, Security Groups, and related cloud services. 


• Hands-on working experience with Datadog is a must, including monitoring, alerting, application 
performance monitoring, logging, dashboards, and service health visibility. 


• Ability to continuously improve existing Datadog dashboards, monitors, alert thresholds, and 
operational views as services evolve and production needs change. 


• Experience managing and improving incident management capabilities, including incident triage, 
escalation, communication, root-cause analysis, post-incident reviews, and follow-up actions. 


• Experience defining and improving reliability practices such as SLOs, SLIs, error budgets, runbooks, 
playbooks, operational readiness checks, and on-call processes. 


• Experience troubleshooting distributed systems, AWS infrastructure, ECS workloads, networking, 
databases, and application performance issues in production environments. 


• Experience in multiple scripting languages such as Python, Bash, PowerShell, JavaScript etc. 


• Experience with managed data platforms such as MongoDB Atlas, Confluent Cloud, Couchbase, 
PlanetScale, ClickHouse, Redis, Postgres etc. 


• Experience supporting mission critical Linux systems at scale; Windows experience is optional but 
good to have. 


• Experience supporting cloud networking DNS, Web Application Firewall, Security Groups, 
Network Access Control List, load balancers etc. 


• Experience supporting containerized workloads using Docker and AWS ECS. 


• Expertise with cloud monitoring and management systems. 


• Experience with cloud security principles and best practices. 


• Familiarity with GitHub and GitHub Actions for managing CI/CD pipelines, release workflows, and 
deployment automation. 


• Experience with monitoring and management tools such as Datadog, Prometheus, Grafana, ELK 
etc. 


• Ability to analyze current technology and operational processes, then develop practical steps to 
improve reliability, alert quality, scalability, and operational efficiency. 


• Willingness to participate in incident response and on-call support for production systems when 
required. 


• Strong problem solving and analytical skills. 


• Strong English communication skills. 


• Ability to multitask, work well under pressure and prioritize work against competing deadlines 
and changing business priorities.

Location & Eligibility

Where is the job
Colombo 07
On-site at the office

Listing Details

Posted
May 19, 2026
First seen
May 21, 2026
Last seen
May 23, 2026

Posting Health

Days active
0
Repost count
0
Trust Level
52%
Scored at
May 21, 2026

Signal breakdown

freshnesssource trustcontent trustemployer trust
Newsletter

Stay ahead of the market

Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.

A
B
C
D
Join 12,000+ marketers

No spam. Unsubscribe at any time.

grubtechSite Reliability Engineer