L
Lightningai~2mo ago
$150,000 – $195,000/yr

Machine Learning Solutions Engineer (ML + Infrastructure Focus)

United StatesUnited States·New Yorkmid
Data ScienceSalesMachine Learning EngineerSales EngineerData & AI
12 views0 saves0 applied

Quick Summary

Requirements Summary

Lightning is looking for a Machine Learning Solutions Engineer with a focus on ML and Infrastructure to join ou Sales team in New York.

Technical Tools
dockerkubernetespythonpytorchab-testingcustomer-successdistributed-systemsetlmachine-learningnetworkingsystem-design

Lightning AI is the company behind PyTorch Lightning. Founded in 2019, we build an end-to-end platform for developing, training, and deploying AI systems—designed to take ideas from research to production with less friction.

Through our merger with Voltage Park, a neocloud and AI Factory, Lightning AI combines developer-first software with cost-efficient, large-scale compute. Teams get the tools they need for experimentation, training, and production inference, with security, observability, and control built in.

We serve solo researchers, startups, and large enterprises. Lightning AI operates globally with offices in New York City, San Francisco, Seattle, and London, and is backed by Coatue, Index Ventures, Bain Capital Ventures, and Firstminute.

 

Lightning is looking for a Machine Learning Solutions Engineer with a focus on ML and Infrastructure to join ou Sales team in New York. As a Machine Learning Solutions Engineer, you will operate at the intersection of machine learning, distributed systems, and cloud infrastructure. You will partner with customers to design and deploy end-to-end AI systems, spanning:

  • Model development and training
  • GPU infrastructure and cluster design
  • Distributed inference and production deployment

This role goes beyond traditional ML solutions engineering—you will act as a technical architect, helping customers make critical decisions across compute, orchestration, and system design.

The role is hybrid out of our New York City office hub, with an in-office requirement of at least 3 days per week and occasional team and company offsites. We are not able to provide visa sponsorship for this role at this time.

 

Responsibilities

~1 min read
  • Partner with customers to understand ML workloads, infrastructure constraints, and scaling requirements
  • Architect end-to-end solutions across:
    • Data pipelines (CPU → GPU workflows)
    • Distributed training (multi-node, multi-GPU)
    • High-throughput inference systems
  • Translate business goals (latency, cost, throughput) into technical system design decisions
  • Design and optimize workloads across GPU clusters (H100, H200, B200, etc.)
  • Advise on:
    • Training vs inference cluster design
    • Interconnect choices (Ethernet vs Infiniband / RDMA vs Roce)
    • Storage strategies (local NVMe vs networked / object storage)
  • Model and optimize for:
    • Tokens/sec, tokens/$
    • Throughput vs latency tradeoffs
    • GPU utilization and scheduling efficiency
  • Design and support deployments on Kubernetes (EKS, GKE, on-prem clusters)
  • Work with:
    • GPU scheduling (time-slicing, MIG, bin-packing)
    • Autoscaling and workload orchestration
    • Helm-based deployments and multi-tenant environments
  • Help customers balance:
    • Raw Kubernetes flexibility vs platform abstraction (Lightning)
  • Build and deliver technical demos and POCs that showcase:
    • Distributed training workflows
    • Scalable inference endpoints
    • End-to-end ML pipelines on Lightning AI
  • Scope and lead POCs aligned to customer success metrics (latency, cost, reliability)
  • Act as the bridge between customers, product, and engineering
  • Provide feedback on:
    • Platform gaps in infrastructure, orchestration, and performance
    • Emerging patterns in GPU usage and distributed systems
  • Influence roadmap across ML workflows and infrastructure capabilities
  • Create technical content
  • Architecture guides (e.g., high-throughput LLM inference systems)
  • Best practices for GPU utilization and scaling
  • Educate customers on modern AI infrastructure patterns

 

  • 3–6+ years experience in:
    • Machine Learning / AI Engineering
    • Solutions Engineering / Sales Engineering / ML Consulting
  • Strong understanding of:
    • Training vs inference workloads
    • Model optimization (quantization, batching, caching, etc.)
  • Experience working with:
    • GPU clusters (NVIDIA stack preferred)
    • Distributed training or inference systems
  • Familiarity with:
    • NCCL, CUDA, or GPU performance profiling
    • Networking concepts (RDMA, Roce, Infiniband, high-throughput systems)
  • Hands-on experience with:
    • Kubernetes (EKS, GKE, or on-prem)
    • Slurm 
    • Containerization (Docker)
  • Exposure to:
    • GPU scheduling in Kubernetes environments
    • Multi-tenant or production ML deployments
  • Strong Python skills (PyTorch preferred)
  • Experience building:
    • ML pipelines
    • APIs or inference services
  • Familiarity with Lightning AI, PyTorch Lightning, or similar frameworks is a plus
  • Ability to:
    • Explain complex infrastructure and ML tradeoffs clearly
    • Run technical discovery and uncover quantifiable success metrics
  • Experience working cross-functionally with:
    • Sales, product, and engineering teams

 

What We Offer

~1 min read

The annual base pay range for this role is $150,000 - $195,000, in addition to a variable pay component and meaningful equity. 

 

We offer a comprehensive and competitive benefits package designed to support our employees’ health, well-being, and long-term success. Benefits may vary by location, team, and role.

Comprehensive medical, dental and vision coverage (U.S.); Private medical and dental insurance (U.K.)
Retirement and financial wellness support (U.S.); Pension contribution (U.K.)
Generous paid time off, plus holidays
Paid parental leave
Professional development support
Wellness and work-from-home stipends
Flexible work environment

Location & Eligibility

Where is the job
New York, United States
On-site at the office
Who can apply
US
Listed under
United States

Listing Details

First seen
March 26, 2026
Last seen
May 28, 2026

Posting Health

Days active
63
Repost count
0
Trust Level
42%
Scored at
May 28, 2026

Signal breakdown

freshnesssource trustcontent trustemployer trust
Newsletter

Stay ahead of the market

Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.

A
B
C
D
Join 12,000+ marketers

No spam. Unsubscribe at any time.

L
Machine Learning Solutions Engineer (ML + Infrastructure Focus)$150k–$195k