Machine Learning Engineer — Inference Optimization
Quick Summary
About the Role We’re looking for a Machine Learning Engineer to own and push the limits of model inference performance at scale. You’ll work at the intersection of research and production—turning cutting-edge models into fast, reliable, and cost-efficient systems that serve real users.
Experience with LLM or long-context model inference Knowledge of inference frameworks (TensorRT, ONNX Runtime, vLLM, Triton) Experience optimizing across different hardware vendors Open-source contributions in ML systems or inference tooling…
About the Role
~1 min readWe’re looking for a Machine Learning Engineer to own and push the limits of model inference performance at scale. You’ll work at the intersection of research and production—turning cutting-edge models into fast, reliable, and cost-efficient systems that serve real users.
This role is ideal for someone who enjoys deep technical work, profiling systems down to the kernel/GPU level, and translating research ideas into production-grade performance gains.
Responsibilities
~1 min read- →
Optimize inference latency, throughput, and cost for large-scale ML models in production
- →
Profile and bottleneck GPU/CPU inference pipelines (memory, kernels, batching, IO)
- →
Implement and tune techniques such as:
- →
Quantization (fp16, bf16, int8, fp8)
- →
KV-cache optimization & reuse
- →
Speculative decoding, batching, and streaming
- →
Model pruning or architectural simplifications for inference
- →
- →
Collaborate with research engineers to productionize new model architectures
- →
Build and maintain inference-serving systems (e.g. Triton, custom runtimes, or bespoke stacks)
- →
Benchmark performance across hardware (NVIDIA / AMD GPUs, CPUs) and cloud setups
- →
Improve system reliability, observability, and cost efficiency under real workloads
Strong experience in ML inference optimization or high-performance ML systems
Solid understanding of deep learning internals (attention, memory layout, compute graphs)
Hands-on experience with PyTorch (or similar) and model deployment
Familiarity with GPU performance tuning (CUDA, ROCm, Triton, or kernel-level optimizations)
Experience scaling inference for real users (not just research benchmarks)
Comfortable working in fast-moving startup environments with ownership and ambiguity
Nice to Have
~1 min readExperience with LLM or long-context model inference
Knowledge of inference frameworks (TensorRT, ONNX Runtime, vLLM, Triton)
Experience optimizing across different hardware vendors
Open-source contributions in ML systems or inference tooling
Background in distributed systems or low-latency services
What We Offer
~1 min readLocation & Eligibility
Listing Details
- Posted
- January 22, 2026
- First seen
- May 6, 2026
- Last seen
- May 8, 2026
Posting Health
- Days active
- 0
- Repost count
- 0
- Trust Level
- 23%
- Scored at
- May 6, 2026
Signal breakdown
Please let featherlessai know you found this job on Jobera.
4 other jobs at featherlessai
View all →Explore open roles at featherlessai.
Similar Machine Learning Engineer jobs
View all →Stay ahead of the market
Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.
No spam. Unsubscribe at any time.