GoDaddy1mo ago

Senior Manager System Engineering

Colombiasenior

OtherSystem

3 views0 saves0 applied

Apply Now

Quick Summary

Overview

Location Details: Colombia, remote. At GoDaddy, the future of work looks different for each team.

Technical Tools

OtherSystem

Join GoDaddy's Forge Ops team at the intersection of Data, Infrastructure, and AI-driven operations. As Senior Manager, Systems Engineering, you will lead the reliability, cost efficiency, and agentic operation of the Data & AI ecosystem that serves GoDaddy. This is a deeply technical leadership role, not a hands-off manager position. You will operate as GoDaddy's L1/L2 authority over critical analytics and data platforms while advancing Forge Operations: a structured operating model designed to transition platform operations from hero-based, expert-dependent support to system-based, agent-assisted, self-improving operations. If you can translate a business problem into a technical architecture and that architecture into team execution — and you want to build the AI Ops pattern for a large-scale data organization, this role is for you.

Responsibilities

~1 min read

→Own and operate GoDaddy's analytical and data intelligence platforms(Redshift, QuickSight, FeedDB, Protegrity, Alation) as the authoritative L1/L2 platform owner — driving reliability, deployment standards, cost optimization, and user enablement across an ecosystem with a 50PB+ data lake and thousands of consumers.
→Lead 24/7 incident management and production operations across 10+ Data & AI platforms, owning MTTR/MTTD targets, AAR rigor, and a root-cause-to-control loop that converts every incident into a runbook, monitoring improvement, or automation — not just a resolved ticket.
→Architect and advanced Forge Ops OS, the team's agent-based operating model. This model uses history-informed early warning, auto-recovery agents, runbook intelligence, and bounded agentic orchestration. The team transitions from operating systems to leading all aspects of agents that operate systems.
→Drive data platform cost efficiency through unit economics— cost per query, cost per workload, cost per dashboard visit — translating AWS spend into measurable business metrics and continuous optimization across Redshift, QuickSight, DPaaS, and ML infrastructure.
→Manage operational planning and executive reporting weekly, monthly, and quarterly. Run a sprint-based improvement program with a near 70% strategic allocation. Provide clear traceability from team execution to company goals and landmark outcomes.

5+ years validated 24/7 production operations leadership— leading incident response end-to-end, owning MTTR performance, leading post-mortems (AARs) that produce controls, and driving the systemic fixes that reduce incident recurrence.
Hands-on AWS architecture/platform expertise — Redshift, EMR/Airflow, Lambda, EKS, S3, IAM/RBAC, and CDK/CloudFormation — with end-to-end operational and cost ownership of at least two production data or analytics platforms.
Systems and software architecture fluency— able to translate business requirements into scalable technical designs, reason about architectural trade-offs, and decompose solutions into actionable engineering tasks without deferring all technical judgment to individual contributors.
Data platform operations at scale— ETL/ELT pipelines, data lakes, orchestration frameworks (Airflow, EMR), and BI tooling — with deep understanding of data quality, SLAs, lineage, and the dependency chains that connect producers to executive-facing consumers.
Technical team leadership with operational rigor— proven ability to lead engineers through sprint-based planning, capacity management, and cross-functional delivery, while maintaining the hands-on technical credibility to unblock, review, and elevate the team's output.

Experience with AI/agentic operations — building or operating LLM-based tools such as automated runbooks, incident response agents, AAR generation systems, or bounded auto-recovery workflows.
Familiarity with graph databases or lineage/observability architectures (e.g., Neptune or equivalent) for dependency mapping, early warning, and blast-radius analysis in large data ecosystems.
Hands-on experience with Databricks or analytical compute platforms (Lakehouse, feature stores, ML infrastructure) in a production operations context.
Experience with data protection platforms (e.g., Protegrity) and PII/tokenization workflows in large-scale data lake or analytics environments.
Familiarity with ServiceNow/CMDB or equivalent incident management systems (Jira, PagerDuty) as operational systems of record — including MTTR/MTTD tracking and CI/lineage integration.

We encourage you to apply even if your experience or abilities don’t align perfectly with every requirement. We value a wide range of backgrounds and transferable skills, and we are excited to support learning and growth.