Quick Summary
Overview
About MBZUAIThe Institute for Foundation Models (IFM) operates some of the world's largest AI supercomputing environments.
Technical Tools
OtherEngineer
About MBZUAI
The Institute for Foundation Models (IFM) operates some of the world's largest AI supercomputing environments.
Position Summary
This role provides operational coverage during Abu Dhabi overnight hours and serves as a primary point of contact for infrastructure monitoring, incident triage, researcher support, and production operations.
The Institute for Foundation Models (IFM) operates some of the world's largest AI supercomputing environments.
Position Summary
This role provides operational coverage during Abu Dhabi overnight hours and serves as a primary point of contact for infrastructure monitoring, incident triage, researcher support, and production operations.
• Monitor health, performance, and availability of large-scale GPU clusters.
• Respond to incidents and perform first-level triage.
• Support researchers and troubleshoot job failures.
• Execute operational runbooks and recovery procedures.
• Validate cluster deployments, upgrades, and maintenance activities.
• Track infrastructure utilization and operational metrics.
• Develop automation and monitoring tools.
• Contribute to documentation and reporting.
• Respond to incidents and perform first-level triage.
• Support researchers and troubleshoot job failures.
• Execute operational runbooks and recovery procedures.
• Validate cluster deployments, upgrades, and maintenance activities.
• Track infrastructure utilization and operational metrics.
• Develop automation and monitoring tools.
• Contribute to documentation and reporting.
Bachelor's degree in Computer Science, Computer Engineering, Software Engineering, Information Technology, Electrical Engineering, Mathematics, Physics, or related disciplines.
• 2+ years in Linux systems administration, SRE, DevOps, cloud operations, HPC, or infrastructure operations.
• Strong Linux troubleshooting skills.
• Experience with scripting using Python or Bash.
• Strong Linux troubleshooting skills.
• Experience with scripting using Python or Bash.
• Slurm.
• GPU infrastructure.
• AWS, Azure, or GCP.
• Grafana, Prometheus, Datadog, or similar tools.
• Containers and Kubernetes.
• AI/ML infrastructure exposure.
• Research computing environments.
• GPU infrastructure.
• AWS, Azure, or GCP.
• Grafana, Prometheus, Datadog, or similar tools.
• Containers and Kubernetes.
• AI/ML infrastructure exposure.
• Research computing environments.
Benefits Include
*Comprehensive medical, dental, and vision benefits
*Bonus
*401K Plan
*Generous paid time off, sick leave and holidays
*Paid Parental Leave
*Employee Assistance Program
*Life insurance and disability
Location & Eligibility
Where is the job
Sunnyvale, United States
On-site at the office
Who can apply
US
Listing Details
- Posted
- June 1, 2026
- First seen
- June 2, 2026
- Last seen
- June 2, 2026
Posting Health
- Days active
- 0
- Repost count
- 0
- Trust Level
- 77%
- Scored at
- June 2, 2026
Signal breakdown
freshnesssource trustcontent trustemployer trust
Salary
USD 150000–300000
per year
External application · ~5 min on Ifm Us's site
Please let Ifm Us know you found this job on Jobera.
3 other jobs at Ifm Us
View all →Explore open roles at Ifm Us.
Browse Similar Jobs
Manager6.1kAssistant Manager5.8kTeam Member5.5kDirector2.9kAssistant2.9kAssociate2.7kConsultant2.7kTechnician2.5kData Collector2.2kFitness & Wellness2.1kCoordinator2.1kRestaurant General Manager1.7kTeam Leader1.7kPart Time1.6kSupervisor1.5kAnalyst1.5kCustomer Service1.3kSocial Worker1.2kOperator1.2kDevelopment1.1k
Newsletter
Stay ahead of the market
Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.
A
B
C
D
No spam. Unsubscribe at any time.
I
HPC EngineerUSD 150000–300000