remotestar-team~1d ago
New
New
Senior System Engineer (Munich, Germany)
OtherSystem Engineer
0 views0 saves0 applied
Quick Summary
Overview
About client : Well-funded and fast-growing deep-tech company founded in 2019. We are the biggest Quantum Software company in the EU. They are also one of the 100 most promising companies in AI in the world (according to CB Insights, 2023) with 150+ employees and growing, fully multicultural and…
Requirements Summary
Systems Programming Expertise: 10+ years of software engineering experience with strong proficiency in Python. You must be comfortable building system agents, APIs, and CLI tools.
Technical Tools
ansiblekubernetespythonpytorchterraformdistributed-systemslinuxnetworkingperformance-optimization
Requirements
~1 min read- Systems Programming Expertise: 10+ years of software engineering experience with strong proficiency in Python. You must be comfortable building system agents, APIs, and CLI tools.
- Deep Kubernetes Knowledge: You understand K8s internals beyond simple deployment. Experience with Custom Resource Definitions (CRDs), Operators, and the Kubernetes API server architecture.
- GPU Ecosystem Experience: Hands-on experience managing NVIDIA GPU clusters. Familiarity with NVIDIA drivers, CUDA toolkit, and the container runtime (NVIDIA Container Toolkit).
- Linux Internals: Deep understanding of the Linux kernel, cgroups, namespaces, and system performance tuning.
- Infrastructure as Code: Mastery of declarative infrastructure tools (Terraform, Ansible) but with a focus on provisioning physical hardware rather than just cloud VMs.
- Problem Solving: A proven track record of debugging complex distributed systems where the root cause could be code, network, or silicon.
Requirements
~1 min read- HPC Background: Experience working with traditional supercomputing schedulers (Slurm, PBS) or modern batch schedulers (Volcano, Kueue, Ray).
- Bare Metal Provisioning: Experience with tools like Cluster API (CAPI), Metal3, Tinkerbell, Canonical MaaS, or OpenStack Ironic.
- High-Speed Networking: Knowledge of RDMA, InfiniBand, GPUDirect, and how to expose these technologies to containerized workloads.
- AI/ML Familiarity: Understanding of how distributed training works (e.g., PyTorch Distributed, Megatron-LM, DeepSpeed) and the infrastructure requirements of Large Language Models (LLMs).
- Observability: Experience building monitoring for hardware health (DCGM) and distributed tracing for long-running jobs.
Requirements
~1 min readResponsibilities
~1 min read- →Building the Control Plane: Designing and developing the software layer (APIs, Controllers, Agents) that automates the lifecycle of bare-metal AI infrastructure.
- →Orchestrating High-Scale Compute: Architecting scheduling solutions for large-scale distributed training jobs across massive clusters of GPUs (NVIDIA H200/B200/B300), ensuring efficient bin-packing and gang scheduling.
- →Optimizing the Fabric: Tuning the software-defined networking layer to support low-latency interconnects (InfiniBand/RDMA/RoCEv2) essential for multi-node training.
- →Developing Kubernetes Extensions: Writing custom Kubernetes Operators and CRDs to abstract complex hardware realities (topology awareness, GPU partitioning) into usable interfaces for our Data Scientists.
- →Hardware-Level Debugging: Investigating and resolving deep systems issues, ranging from PCIe bus errors and NCCL communication timeouts to kernel panics on bare-metal nodes.
- →Defining Standards: Creating the "Golden Image" for AI workloads, managing drivers, firmware, and OS optimizations to squeeze maximum performance out of the hardware.
What We Offer
~1 min read✓Indefinite contract.
✓Equal pay guaranteed.
✓Variable performance bonus.
✓Signing bonus.
✓Relocation package (if applicable).
✓Private health insurance.
✓Eligibility for educational budget according to internal policy.
✓Hybrid opportunity.
✓Flexible working hours.
✓Working in a high paced environment, working on cutting edge technologies.
✓Career plan. Opportunity to learn and teach.
✓Progressive Company. Happy people culture
Location & Eligibility
Where is the job
Cambourne, United Kingdom
Remote within one country
Who can apply
GB
Listing Details
- First seen
- May 6, 2026
- Last seen
- May 8, 2026
Posting Health
- Days active
- 0
- Repost count
- 0
- Trust Level
- 59%
- Scored at
- May 6, 2026
Signal breakdown
freshnesssource trustcontent trustemployer trust
External application · ~5 min on remotestar-team's site
Please let remotestar-team know you found this job on Jobera.
4 other jobs at remotestar-team
View all →Explore open roles at remotestar-team.
Similar System Engineer jobs
View all →Senior Communication System Engineer
Senior System Engineer
Full-time
W
WorkatbackbaseSystem Engineer
System Engineer (Monterey, CA)
System Engineer within Kubernetes
C
CelaralabsRemoteSystem Engineer – Security Administration & Endpoint Management (QB - SE - 20260507)
ContractRemote
Browse Similar Jobs
Manager6.2kAssistant Manager5.6kTeam Member5.1kEngineer3.7kDirector3kAssistant2.8kAssociate2.7kConsultant2.6kTechnician2.2kData Collector2.2kFitness & Wellness2.1kSupervisor1.9kCoordinator1.9kRestaurant General Manager1.7kTeam Leader1.6kAnalyst1.6kPart Time1.3kBehavioral Health1.3kCrew Member1.2kDevelopment1.2k
Newsletter
Stay ahead of the market
Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.
A
B
C
D
No spam. Unsubscribe at any time.