Director, Support Engineering

San Franciscoexecutive

OtherDirector

0 views0 saves0 applied

Apply Now

Quick Summary

Key Responsibilities

1s, team reviews, and escalation retrospectives. Operationalization and Scaling Assess and overhaul support workflows, SLA frameworks, and escalation playbooks Build triage, prioritization,

Technical Tools

OtherDirector

About the Role

~1 min read

We’re hiring a Support Leader to own and scale Together AI’s customer support function across two distinct, technically demanding domains: API Support (billing, serverless inference, and dedicated inference) and GPU Support (large-scale GPU infrastructure for model training workloads). You’ll work closely with Together AI’s VP of Customer Experience and partner tightly with SRE, Inference Platform, and Engineering to represent customers internally and drive resolution at speed. This is a player-coach role: you’ll be hands-on in escalations.

Our support operation runs 24/7. Our GPU infrastructure customers hold us to high-stakes SLAs on training workloads. Our API customer base spans thousands of PLG and enterprise accounts relying on our serverless and dedicated inference endpoints. Both domains need a leader who can keep pace technically and build the operational muscle to scale.

Responsibilities

~1 min read

Directly manage and develop a team of support engineers and technical account specialists across API Support and GPU Support functions.
Establish clear performance expectations, career growth paths, and a coaching culture leveraged to identify skill gaps and build training programs to close them.
Run structured 1:1s, team reviews, and escalation retrospectives.

Assess and overhaul support workflows, SLA frameworks, and escalation playbooks
Build triage, prioritization, and handoff protocols that allow the team to scale with customer growth without proportional headcount growth.
Define and own support KPIs: SLA attainment, time-to-resolution, escalation rate, CSAT

Jump into complex, active GPU infrastructure issues alongside your team. Investigate NCCL and InfiniBand failures, SSH connection stalls, Kubelet TLS misconfigurations, GPU/RDMA provisioning timeouts, NFS RDMA mount failures, VAST storage failures, network fabric degradation, etc.
Manage high-stakes SLA obligations with GPU cloud customers running multi-thousand-GPU training workloads
Coordinate closely with SRE and infrastructure engineering on hardware-level issues and cluster bringup.

Own the support surface for Together AI’s API platform: serverless inference, dedicated inference endpoints (self-serve and managed), billing, rate limits, model upload (BYOM), and API authentication.
Represent the team on complex cases: dedicated endpoint startup failures, safetensors validation errors, NFS/storage performance issues on inference clusters, billing disputes and negative-balance enforcement, and rate limit escalations.
Work with the Inference Platform, Commerce, and Product teams to surface patterns and drive fixes upstream.

Be the escalation point for your team’s highest-severity customer issues — triage fast, communicate clearly to customers and internal stakeholders, and drive to resolution.
Partner with SRE, Engineering, and Sales on shared priorities. Represent the support team’s perspective in cross-functional planning.
Own the relationship with support tooling vendors and drive improvements to alerting, SLA tracking, and ticket routing.

Systematically analyze ticket patterns and surface product and infrastructure gaps to Engineering and Product. Turn support signal into actionable roadmap input.
Build documentation and self-service resources that reduce inbound volume over time.

Requirements

~1 min read

10+ years of support engineering or technical support leadership experience, with at least 3 years managing a team.
Demonstrated experience leading infrastructure support or cloud operations. You understand how large-scale workloads behave on distributed systems.
Working knowledge of AI infrastructure. You know how APIs work, can reason about latency and throughput issues, and understand the operational surface of a managed inference platform.
Technical depth to be a credible player-coach. Ability to guide engineers through root cause analysis, and bring credibility to customer-facing escalations.
Experience running SLA-driven support operations with real accountability. Familiarity with Pylon or equivalent support ticketing platforms (Zendesk, etc.) and PagerDuty-style alerting systems.
Strong communication skills, especially under pressure. You can write a clear, concise customer-facing update in the middle of a live incident and distill a complex infrastructure issue into a crisp internal escalation.
Startup mindset. You’re comfortable building process where none exists, and you thrive in environments where priorities shift fast.

Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure.

What We Offer

~1 min read

We offer competitive compensation, startup equity, health insurance, and other benefits, as well as flexibility in terms of remote work. The US base salary range for this full-time position is: $290,000 - $310,000K + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge.

Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our Privacy Policy at https://www.together.ai/privacy