featherlessai4mo ago

Machine Learning Engineer — Inference Optimization

(world)Remotefull-timemid

Machine Learning EngineerData

2 views0 saves0 applied

Apply Now

Quick Summary

Overview

About the Role We’re looking for a Machine Learning Engineer to own and push the limits of model inference performance at scale. You’ll work at the intersection of research and production—turning cutting-edge models into fast, reliable, and cost-efficient systems that serve real users.

Requirements Summary

Experience with LLM or long-context model inference Knowledge of inference frameworks (TensorRT, ONNX Runtime, vLLM, Triton) Experience optimizing across different hardware vendors Open-source contributions in ML systems or inference tooling…

Technical Tools

pytorchdeep-learningdistributed-systemsmachine-learningperformance-optimization

About the Role

~1 min read

We’re looking for a Machine Learning Engineer to own and push the limits of model inference performance at scale. You’ll work at the intersection of research and production—turning cutting-edge models into fast, reliable, and cost-efficient systems that serve real users.

This role is ideal for someone who enjoys deep technical work, profiling systems down to the kernel/GPU level, and translating research ideas into production-grade performance gains.

Responsibilities

~1 min read

→
Optimize inference latency, throughput, and cost for large-scale ML models in production
→
Profile and bottleneck GPU/CPU inference pipelines (memory, kernels, batching, IO)
→
Implement and tune techniques such as:
- →
  Quantization (fp16, bf16, int8, fp8)
- →
  KV-cache optimization & reuse
- →
  Speculative decoding, batching, and streaming
- →
  Model pruning or architectural simplifications for inference
→
Collaborate with research engineers to productionize new model architectures
→
Build and maintain inference-serving systems (e.g. Triton, custom runtimes, or bespoke stacks)
→
Benchmark performance across hardware (NVIDIA / AMD GPUs, CPUs) and cloud setups
→
Improve system reliability, observability, and cost efficiency under real workloads

Strong experience in ML inference optimization or high-performance ML systems
Solid understanding of deep learning internals (attention, memory layout, compute graphs)
Hands-on experience with PyTorch (or similar) and model deployment
Familiarity with GPU performance tuning (CUDA, ROCm, Triton, or kernel-level optimizations)
Experience scaling inference for real users (not just research benchmarks)
Comfortable working in fast-moving startup environments with ownership and ambiguity