Site Reliability Engineer (SRE)

Taiwan·Taipei CityFull-timemid

EngineeringDevops Engineer

5 views0 saves0 applied

Apply Now

Quick Summary

Overview

Our client is an innovative technology company operating large-scale cloud and edge infrastructure supporting AI-driven products and services. As the platform continues to expand,

Technical Tools

EngineeringDevops Engineer

Our client is an innovative technology company operating large-scale cloud and edge infrastructure supporting AI-driven products and services. As the platform continues to expand, they are looking for a Site Reliability Engineer to help build highly reliable, observable, and secure systems that power mission-critical applications.

This role offers the opportunity to work across cloud infrastructure, Kubernetes, observability, security, automation, and emerging AI operational platforms in a fast-growing environment.

Design and maintain monitoring, alerting, and dashboarding systems across cloud and edge environments.
Build visibility into system health through metrics, logs, traces, and performance analytics.
Define and manage SLIs, SLOs, and service reliability targets.
Develop proactive monitoring and anomaly detection capabilities to identify issues before they impact users.

Deploy, manage, and optimize containerized workloads running on Kubernetes.
Maintain scalable cloud infrastructure across production environments.
Improve system performance, availability, and operational efficiency.
Support infrastructure provisioning through Infrastructure-as-Code practices.

Implement secure access controls and audit mechanisms across infrastructure environments.
Monitor for cybersecurity threats, unauthorized access attempts, and service disruptions.
Develop alerting and response procedures for security-related incidents.
Contribute to operational security best practices and governance initiatives.

Automate repetitive operational tasks to reduce manual effort and improve reliability.
Build tooling and scripts to streamline infrastructure operations.
Support CI/CD workflows and deployment automation.
Promote documentation, operational standards, and continuous improvement.

Participate in on-call rotations and incident management.
Lead troubleshooting efforts during production incidents.
Conduct root-cause analysis and post-mortem reviews.
Drive long-term improvements that enhance system resilience.

Work closely with software, AI, machine learning, hardware, and product teams.
Ensure new services are production-ready with appropriate monitoring, security, and reliability measures.
Support the operational needs of both cloud-based and distributed edge computing environments.

3+ years of experience in Site Reliability Engineering, DevOps, Platform Engineering, or Production Operations.

Hands-on experience with AWS or other major cloud platforms.

Strong understanding of observability and monitoring tools such as Grafana, Prometheus, or similar platforms.

Solid Linux administration and troubleshooting skills.

Experience with Docker, Kubernetes, and containerized workloads.

Experience with Infrastructure as Code tools such as Terraform.

Proficiency in at least one scripting or programming language (Python, Bash, etc.).

Understanding of networking fundamentals and infrastructure security concepts.

Experience supporting production systems and participating in incident response.

Strong automation mindset and commitment to operational excellence.

Experience operating large-scale edge computing or IoT deployments.

Familiarity with zero-trust access management platforms.