Senior Technical Project Manager (Hardware Automation)
Quick Summary
Why work at Nebius Nebius is leading a new era in cloud computing to serve the global AI economy. We create the tools and resources our customers need to solve real-world challenges and transform industries, without massive infrastructure costs or the need to build large in-house AI/ML teams.
Own end-to-end delivery of complex, cross-functional programs across multiple hardware infrastructure workstreams simultaneously.
5+ years in technical program or project management, with a track record of delivering across multiple hardware, software or infrastructure domains.
Nebius is leading a new era in cloud infrastructure for the global AI economy. We are building a full-stack AI cloud platform that supports developers and enterprises from data and model training through to production deployment, without the cost and complexity of building large in-house AI/ML infrastructure.
Built by engineers, for engineers. From large-scale GPU orchestration to inference optimization, we own the hard problems across compute, storage, networking and applied AI.
Listed on Nasdaq (NBIS) and headquartered in Amsterdam, we have a global footprint with R&D hubs across Europe, the UK, North America and Israel. Our team of 1,500+ includes hundreds of engineers with deep expertise across hardware, software and AI R&D.
Requirements
~1 min read
- 5+ years of technical project or program management experience, with at least 2 years managing automation, platform engineering, or infrastructure tooling program in a data centre or cloud infrastructure environment.
- Technical fluency in hardware or software
- Demonstrated experience managing program that automate operational workflows.
- Strong operational risk awareness: track record of structuring staged rollouts, defining rollback criteria, and insisting on safety gates before automation runs unsupervised on production infrastructure.
- Experience coordinating across engineering and operations stakeholders with different risk tolerances — bridging teams that move fast and teams that prioritise stability without letting that tension stall delivery.
Responsibilities
~1 min read- →Own end-to-end delivery of hardware automation program: zero-touch provisioning, firmware update pipelines, automated burn-in and validation, fault detection and self-healing, and operational tooling for data centre technicians.
- →Translate high-level automation goals — reduce mean time to provision, eliminate manual firmware update toil, automate 80% of common fault remediation actions — into structured project plans with clear milestones, owners, dependencies, and success metrics.
- →Manage the full program lifecycle from discovery and scoping through engineering delivery, staged rollout, production validation, and handoff to operations — ensuring automation is reliable enough to trust before it runs unsupervised on live infrastructure.
- →Run structured delivery cadences: sprint planning, weekly engineering syncs, milestone reviews, and go/no-go gates for automation rollout to production — without creating process overhead that slows the engineering team down.
- Coordinate across Hardware Automation engineering, Hardware Infrastructure, data centre operations, network engineering, baremetal software, and cloud control plane teams — identifying and resolving the cross-team dependencies that are the most common source of program delay.
- Partner with data centre operations leadership to understand the manual workflows targeted for automation, ensure engineering solutions match operational reality, and manage the change process when new automated systems replace established manual procedures.
- Manage relationships with hardware vendors and firmware teams whose delivery timelines and API roadmaps directly gate automation program schedules — tracking commitments, escalating slippage early, and building contingency plans where vendor dependency is unavoidable.
- Facilitate technical alignment across engineering teams when automation program require changes to adjacent systems: asset management platforms, monitoring infrastructure, ITSM integrations, or cloud control plane APIs.
- Maintain a program risk register with specific focus on operational safety risks: automation failures that could affect production fleet availability, firmware rollouts with insufficient rollback capability, or self-healing workflows with poorly bounded blast radius.
- Own the staged rollout framework for automation deployments: define canary criteria, rollout gates, rollback triggers, and monitoring requirements that must be in place before any automation system runs unsupervised at scale.
- Ensure every automation program has a clearly documented failure mode analysis — what happens when the automation breaks, who gets alerted, and how the fleet returns to a known-good state — reviewed by engineering and operations before production deployment.
- Proactively identify automation initiatives where the risk-benefit trade-off warrants a slower, more conservative rollout cadence, and make that case clearly to engineering and leadership rather than defaulting to schedule pressure.
- Define and track the outcome metrics that demonstrate automation program value: manual operations hours eliminated, mean time to provision, mean time to remediate faults, firmware update cycle time, and human-error-driven incident reduction.
- Build program dashboards that give engineering, operations, and leadership real-time visibility into automation coverage (what percentage of the fleet lifecycle is automated), reliability (automation success rates by workflow), and backlog (outstanding manual toil targeted for automation).
- Produce concise, data-driven status reports and executive summaries that translate engineering progress into business outcomes — connecting automation delivery to fleet scalability, operational cost, and reliability metrics that matter to Nebius leadership.
- Own the Hardware Automation team roadmap: maintain a prioritised backlog of automation opportunities, scored by toil reduction potential, operational risk, and engineering complexity, and drive quarterly planning that allocates engineering capacity to the highest-value initiatives.
- Facilitate structured toil assessment processes with data centre operations and infrastructure engineering — identifying, quantifying, and ranking manual workflows that are the best candidates for automation investment.
- Track the automation landscape at Nebius competitors and in the broader hyperscale infrastructure community; bring external best practices and reference implementations into roadmap conversations to accelerate Nebius automation ambitions.
- Develop and maintain program templates, playbooks, and rollout checklists specific to hardware automation delivery — capturing the operational safety requirements, stakeholder sign-off gates, and monitoring baselines that every automation program must satisfy before going live.
- Build post-mortem and retrospective practices that capture learnings from automation incidents and near-misses, and feed those learnings back into engineering standards and future program planning.
- Champion a culture of measurable automation impact: every program ships with defined success metrics, and post-launch measurement is treated as a first-class engineering activity, not an afterthought.
Nice to Have
~1 min read- Background as a systems engineer, SRE, or infrastructure software engineer before transitioning to program management — someone who has written automation scripts or operated large fleets rather than only managed people who do.
- Experience with specific hardware automation domains: zero-touch provisioning (Ironic, Tinkerbell, or similar), firmware orchestration (Redfish, vendor-specific update tools), automated hardware testing frameworks, or self-healing remediation systems.
- Familiarity with infrastructure-as-code and configuration management tooling (Terraform, Ansible, Salt) from a delivery and rollout management perspective.
- Exposure to data centre operations — rack and stack, cabling, power and cooling management — sufficient to understand what automation means to a technician on the floor.
- Experience with site reliability engineering (SRE) practices: toil measurement, error budgets, and the discipline of treating operational automation as a first-class engineering investment.
- Knowledge of observability platforms (Prometheus, Grafana, Datadog, or equivalent) used to monitor automation system health and measure operational outcomes.
What We Offer
~1 min readFast moving - Bold thinking - Constant growth - Meaningful impact - Trust and real ownership - Opportunity to shape the future of AI
Nebius is an equal opportunity employer. We are committed to fostering an inclusive and diverse workplace and to providing equal employment opportunities in all aspects of employment. We do not discriminate on the basis of race, color, religion, sex (including pregnancy), national origin, ancestry, age, disability, genetic information, marital status, veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by applicable law.
Applicants must be authorized to work in the country in which they apply and will be required to provide proof of employment eligibility as a condition of hire.
If you need accommodations during the application process, please let us know.
Location & Eligibility
Listing Details
- Posted
- April 30, 2026
- First seen
- April 30, 2026
- Last seen
- May 27, 2026
Posting Health
- Days active
- 26
- Repost count
- 0
- Trust Level
- 31%
- Scored at
- May 27, 2026
Signal breakdown
Nebius is a cutting-edge AI cloud platform that offers scalable infrastructure for developing and deploying AI solutions.
View company profilePlease let Nebius know you found this job on Jobera.
4 other jobs at Nebius
View all →Explore open roles at Nebius.
Similar Project Manager jobs
View all →Browse Similar Jobs
Stay ahead of the market
Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.
No spam. Unsubscribe at any time.