Blue-Signal-Search~22d ago

LLM Inference Kernel Engineer MLA

WifiRemotemid

Software EngineeringKernel Engineer

2 views0 saves0 applied

Apply Now

Quick Summary

Overview

LLM Inference Kernel Engineer MLA Location: Remote, United States A high-growth, venture-backed AI innovator is pushing the boundaries of large-scale model performance, focusing on next-generation inference systems that operate at the intersection of model architecture and GPU execution.

Key Responsibilities

Design and implement high-performance GPU kernels tailored for large language model inference workloads Optimize CUDA kernels with a focus on memory efficiency, execution speed, and latency reduction Enhance token generation performance, KV cache…

Requirements Summary

Strong experience developing GPU kernels using CUDA C or C++ in performance-critical environments Hands-on experience optimizing inference workloads for large language models rather than purely research-based modeling Solid understanding of…

Technical Tools

cppmachine-learningperformance-optimization

Responsibilities

~1 min read

→Design and implement high-performance GPU kernels tailored for large language model inference workloads
→Optimize CUDA kernels with a focus on memory efficiency, execution speed, and latency reduction
→Enhance token generation performance, KV cache utilization, and decoding efficiency in large-scale models
→Collaborate on integrating optimized kernels into modern inference serving frameworks such as vLLM or similar systems
→Work closely with a small, highly technical team to rapidly prototype, test, and deploy performance improvements
→Apply advanced techniques such as kernel fusion, tiling strategies, and warp-level optimization to improve throughput
→Translate complex attention mechanisms into production-ready, scalable GPU implementations

Strong experience developing GPU kernels using CUDA C or C++ in performance-critical environments
Hands-on experience optimizing inference workloads for large language models rather than purely research-based modeling
Solid understanding of attention mechanisms, with exposure to advanced implementations such as fused attention or similar approaches
Familiarity with modern inference stacks and serving frameworks
Deep knowledge of GPU architecture, including memory hierarchy, bandwidth constraints, and latency tradeoffs
Ability to operate in a fast-paced, highly iterative environment with minimal oversight

Requirements

~1 min read

Experience working with advanced attention techniques such as latent attention or similar architectures
Exposure to large-scale or distributed model inference environments, including mixture-of-experts systems
Contributions to performance optimization projects, open-source kernels, or inference tooling Familiarity with GPU profiling and performance analysis tools
Background that bridges model architecture, systems engineering, and deployment layers

This is not a traditional machine learning engineering position. The work sits at one of the most performance-critical layers in the AI stack, where low-level optimization directly impacts real-world model capability. You will have the opportunity to shape how advanced models operate at scale, contributing to meaningful innovations in inference performance and system efficiency.

Blue Signal is an award-winning, executive search firm specializing in various specialties. Our recruiters have a proven track record of placing top-tier talent across industry verticals, with deep expertise in numerous professional services. Learn more at bit.ly/46Gs4yS