HPC Infrastructure Engineer

Part-Timemid

EngineeringDevops Engineer

0 views0 saves0 applied

Apply Now

Quick Summary

Key Responsibilities

Design, build, and maintain high-performance computing (HPC) infrastructure. Develop and implement scalable distributed systems for complex computing tasks.

Requirements Summary

At least 3-5 years of experience in HPC infrastructure, systems engineering, or a similar role. Strong understanding of systems engineering principles and scaling strategies.

Technical Tools

EngineeringDevops Engineer

Insight Softmax is a leading organization in the fields of machine learning, data science, and high-performance computing (HPC). We are dedicated to providing innovative solutions in computing technology to solve complex problems in various sectors including research, science, manufacturing, finance, and data analytics.

Description:

We are hiring an HPC Infrastructure Engineer for our engineering team. Within the HPC realm, we design and build CFD (computational fluid dynamics) and simulation services.
The ideal candidate will have a strong background in building and maintaining high-performance computing services, expertise in distributed systems, and a passion for systems engineering, scaling, and storage architecture. This role involves working closely with our development teams to design, implement, and optimize HPC solutions that meet our growing needs.
This opportunity is particularly unique, placed within our skunkworks team at Insight Softmax. While the exact nature and purpose of this work is confidential, it is an extremely exciting set of projects and objectives that touch HPC, ML, Simulation, and Quantum swimlanes. Our HPC team is building CFD services and architecture that will integrate directly with ML and Simulation services. You will interface and work with some of the most experienced and talented people on the planet across these swimlanes.
The infrastructure team we are putting together requires individuals with focus on delivery and results, hunger to learn, ability to adapt, and also thrive in an environment where you will not be provided all the training nor instructions to get your deliverables done. You will be given top-level engineering goals and milestones, so you are responsible for figuring out and delivering what is required to get there. Folks with startup experience may be more comfortable in this position, as will people that have a mental itch that can never be fully scratched. While senior-level individuals often bring well-needed experience to the table, we are always open to less-experienced individuals who may have great attributes suitable for our team culture.
Depending on engineering objectives, priorities, and team member skill sets, in the future we may swap team members between responsibilities, or we may all team up together to finish one project more quickly, so you may have the opportunity to work on other projects (ML, Simulation, Quantum, etc) outside of HPC.

Key Responsibilities:

Design, build, and maintain high-performance computing (HPC) infrastructure.
Develop and implement scalable distributed systems for complex computing tasks.
Scaling services and systems from smaller deployments of ~50 nodes into larger ~250+ node clusters.
Using orchestration, configuration management, virtualization, linux, CLI, deploy tools, monitoring, APM.
Monitor HPC systems performance and implement improvements to ensure scalability and efficiency.
Optimizing engineering and deliverables to balance feature development, product quality, service reliability.

Qualifications:

At least 3-5 years of experience in HPC infrastructure, systems engineering, or a similar role.

Strong understanding of systems engineering principles and scaling strategies.
Deep knowledge of at least one of AWS, GCP, or Azure. Preferably AWS.
Strong linux chops.
Experience with data architecture and large-scale data processing.

Optional Experience and Skills:

HPC CFD experience.
Scaling services and systems from single-region clusters into multi-regional deployments.
Experience with clusters of greater than 100 nodes.
Building on-prem systems (on-premise, data center, rack-n-stack, etc).
Distributing workloads onto both cloud and on-prem systems.
Balancing business needs, product delivery dates, and customer satisfaction.

Bonus Experience and Skills:

Backend work like SQL, API development, and serverless.
Software engineering
Python, and/or other programming languages
Harnessing creativity, generating innovative products and features, and ensuring a delightful customer experience.
Using linux as your primary desktop workstation environment.

Attributes that we value:

Transparency
Grit
Effectiveness
Willingness to learn

Location and Hours:

Full time position
Remote work environment
80% of your work schedule will be during the business time zones for Latin America and Canada.
Some customer meetings will occur each week at non-standard / nighttime hours, as we currently support a global team across 3-4 continents.
Travel for up to 4 weeks per year to customer destinations

References: