Braze1mo ago

Platform Support Engineer

São Paulomid

EngineeringCustomer SupportPlatform Support Engineer

2 views0 saves0 applied

Apply Now

Quick Summary

Overview

At Braze, we have found our people. We’re a genuinely approachable, exceptionally kind, and intensely passionate crew. We seek to ignite that passion by setting high standards, championing teamwork,

Technical Tools

EngineeringCustomer SupportPlatform Support Engineer

At Braze, we have found our people. We’re a genuinely approachable, exceptionally kind, and intensely passionate crew.

We seek to ignite that passion by setting high standards, championing teamwork, and creating work-life harmony as we collectively navigate rapid growth on a global scale while striving for greater equity and opportunity – inside and outside our organization.

To flourish here, you must be prepared to set a high bar for yourself and those around you. There is always a way to contribute: Acting with autonomy, having accountability and being open to new perspectives are essential to our continued success.

Our deep curiosity to learn and our eagerness to share diverse passions with others gives us balance and injects a one-of-a-kind vibrancy into our culture.

If you are driven to solve exhilarating challenges and have a bias toward action in the face of change, you will be empowered to make a real impact here, with a sharp and passionate team at your back. If Braze sounds like a place where you can thrive, we can’t wait to meet you.

Responsibilities

~2 min read

Platform Support Engineers (PSEs) are the first line of defense in ensuring the health and availability of Braze’s platform and systems. As part of a global triage team, they actively monitor system performance, respond to alerts, and execute runbooks, SOPs (Standard Operating Procedures), and MOPs (Maintenance Operating Procedures) to address operational issues.

Braze operates at a massive scale with over 3.3 billion monthly active users across our customers, collecting hundreds of billions of data points each month and sending billions of messages to end-users daily. We use a diverse technology stack rooted in Ruby on Rails, MongoDB, Redis, Kafka, Kubernetes, and more. The Braze Operations Team optimizes our response mechanisms by centralizing triage and monitoring responsibilities. It allows our other engineering teams to focus on what they do best while we do what we do best. As a Platform Support Engineer at Braze, you will focus on maintaining uptime and reliability, collaborating with engineers to escalate complex issues, and contributing to continuously improving operational processes.

Main responsibilities:

→Active System Monitoring:

→Use monitoring tools (e.g., Datadog, Prometheus, or similar) to observe the health of platform systems and services continuously
→Proactively identify and respond to performance anomalies, outages, or unusual system behavior
→Maintain awareness of ongoing incidents and collaborate with relevant teams to ensure timely resolution

→Incident Response and Triage:

→Act as the first responder to system alerts, determining the severity and scope of issues
→Execute predefined runbooks, SOPs, and MOPs to mitigate incidents and restore services
→When incidents exceed the scope of triage procedures, escalate issues to appropriate engineering teams (e.g., SREs or Platform Engineers)

→Operational Procedures:

→Follow and improve operational processes for incident management, system health checks, and routine maintenance tasks
→Maintain and update runbooks, ensuring accuracy and relevance to current systems and practices
→Participate in post-incident reviews to improve documentation and operational readiness

→Collaboration and Communication:

→Provide clear, concise communication during incidents, ensuring stakeholders know the status and progress of the resolution
→Collaborate with SREs, Platform Engineers, and other teams to enhance monitoring, alerting, and operational tools
→Actively participate in training sessions to stay current on new systems and tools introduced by engineering teams

→Continuous Improvement:

→Identify monitoring, documentation, and procedure gaps and suggest improvements to enhance efficiency and effectiveness
→Assist in testing new runbooks, tools, and processes to improve incident response times
→Contribute to the automation of routine tasks to reduce manual toil

Experience:

1-3 years of experience in technical operations, system administration, or entry-level cloud engineering roles
Familiarity with cloud platforms (AWS, GCP, Azure), kubernetes, and basic computing, storage, and networking concepts
Experience with monitoring and alerting tools (e.g., Datadog, Prometheus, Grafana) is a plus

Skills:

Strong troubleshooting and problem-solving skills, with the ability to follow processes and escalate appropriately
Proficiency in scripting or automation tools (e.g., Python, Bash) is a bonus
Familiarity with incident management processes and ITIL best practices

Mindset:

Detail-oriented and committed to maintaining system health and uptime
Eager to learn and grow, with a passion for operational excellence
Collaborative and communicative, able to work effectively in a global, distributed team

#LI-Hybrid