LLM Inference Kernel Engineer MLA
Quick Summary
LLM Inference Kernel Engineer MLA Location: Remote, United States A high-growth, venture-backed AI innovator is pushing the boundaries of large-scale model performance, focusing on next-generation inference systems that operate at the intersection of model architecture and GPU execution.
Design and implement high-performance GPU kernels tailored for large language model inference workloads Optimize CUDA kernels with a focus on memory efficiency, execution speed, and latency reduction Enhance token generation performance, KV cache…
Strong experience developing GPU kernels using CUDA C or C++ in performance-critical environments Hands-on experience optimizing inference workloads for large language models rather than purely research-based modeling Solid understanding of…
Responsibilities
~1 min read- →Design and implement high-performance GPU kernels tailored for large language model inference workloads
- →Optimize CUDA kernels with a focus on memory efficiency, execution speed, and latency reduction
- →Enhance token generation performance, KV cache utilization, and decoding efficiency in large-scale models
- →Collaborate on integrating optimized kernels into modern inference serving frameworks such as vLLM or similar systems
- →Work closely with a small, highly technical team to rapidly prototype, test, and deploy performance improvements
- →Apply advanced techniques such as kernel fusion, tiling strategies, and warp-level optimization to improve throughput
- →Translate complex attention mechanisms into production-ready, scalable GPU implementations
- Strong experience developing GPU kernels using CUDA C or C++ in performance-critical environments
- Hands-on experience optimizing inference workloads for large language models rather than purely research-based modeling
- Solid understanding of attention mechanisms, with exposure to advanced implementations such as fused attention or similar approaches
- Familiarity with modern inference stacks and serving frameworks
- Deep knowledge of GPU architecture, including memory hierarchy, bandwidth constraints, and latency tradeoffs
- Ability to operate in a fast-paced, highly iterative environment with minimal oversight
Requirements
~1 min read- Experience working with advanced attention techniques such as latent attention or similar architectures
- Exposure to large-scale or distributed model inference environments, including mixture-of-experts systems
- Contributions to performance optimization projects, open-source kernels, or inference tooling Familiarity with GPU profiling and performance analysis tools
- Background that bridges model architecture, systems engineering, and deployment layers
This is not a traditional machine learning engineering position. The work sits at one of the most performance-critical layers in the AI stack, where low-level optimization directly impacts real-world model capability. You will have the opportunity to shape how advanced models operate at scale, contributing to meaningful innovations in inference performance and system efficiency.
Blue Signal is an award-winning, executive search firm specializing in various specialties. Our recruiters have a proven track record of placing top-tier talent across industry verticals, with deep expertise in numerous professional services. Learn more at bit.ly/46Gs4yS
Location & Eligibility
Listing Details
- First seen
- May 6, 2026
- Last seen
- May 8, 2026
Posting Health
- Days active
- 0
- Repost count
- 0
- Trust Level
- 46%
- Scored at
- May 6, 2026
Signal breakdown
Please let Blue-Signal-Search know you found this job on Jobera.
4 other jobs at Blue-Signal-Search
View all →Explore open roles at Blue-Signal-Search.
Similar Kernel Engineer jobs
View all →Browse Similar Jobs
Stay ahead of the market
Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.
No spam. Unsubscribe at any time.