aiand2mo ago

Member of Technical Staff - Inference Optimization

Japan·Yokohamafull-timelead

OtherMember Of Technical Staff

1 views0 saves0 applied

Apply Now

Quick Summary

Overview

About ai& ai& is a new global AI technology company dedicated to meeting the world's growing demand for AI. Our vision is twofold: to serve as a premier AI lab specializing in localization, and to act as a global infrastructure and compute provider.

Key Responsibilities

Custom Kernel Development Design and implement high-performance kernels for core AI primitives including GEMM, attention, normalization, and convolution.

Technical Tools

pytorchi18n

ai& is a new global AI technology company dedicated to meeting the world's growing demand for AI. Our vision is twofold: to serve as a premier AI lab specializing in localization, and to act as a global infrastructure and compute provider. We are building a unified, optimized global platform that integrates next-generation data centers and infrastructure, heterogeneous compute serving, and advanced model services. We believe that the most effective way to build and scale AI is to own the stack from top to bottom.

At ai&, we empower small teams with the autonomy needed to tackle significant challenges. Our approach is to deconstruct large problems into manageable components and solve complex issues collaboratively. We seek highly motivated, mission-driven individuals who demonstrate strong personal agency. We value curiosity as the foundation of talent, and we are looking for people eager to develop alongside our evolving technology and expanding business.

We are actively hiring worldwide, with presence in Tokyo, SF, Austin, and Toronto. We are more than happy to meet exceptional talent where they are.

As a Kernel Optimization Engineer, your objective is to extract everything from heterogeneous GPU hardware. This means going below the framework layer, writing, profiling, and tuning custom CUDA and ROCm/HIP kernels that sit at the heart of our inference and training stack. You will work across NVIDIA and AMD silicon, understanding the deep architectural differences between the two and writing code that is optimal for each.

This is not a role about deploying existing kernels. It is about authoring them. You will identify bottlenecks in the execution loop including memory bandwidth saturation, warp divergence, occupancy limits, and cache thrashing, and build solutions from first principles. You will work closely with our inference and serving team to ensure that the kernels you build translate into real-world performance gains — but your domain is the kernel layer and everything below it.

The scope spans attention mechanisms, quantization primitives, custom activation functions, fused operators, and the communication kernels that tie multi-GPU systems together. The ideal candidate has a hardware-first intuition: they think in warps, tiles, and memory hierarchies before they think in frameworks. They are equally comfortable reading PTX and roofline charts. And they are never done optimizing.

Responsibilities

~1 min read

Deep Kernel Authorship You have written production CUDA or ROCm kernels from scratch. You understand warp execution, shared memory bank conflicts, occupancy, and instruction-level parallelism at an intuitive level. Strong proficiency in C++11 or higher, CUDA, Triton, and ideally LLVM/MLIR.
Hardware Architecture Knowledge Strong familiarity with NVIDIA Hopper/Ampere and AMD CDNA architectures. You know the differences between HBM bandwidth profiles, cache sizes, and execution units and you write code that reflects that knowledge. Deep understanding of memory layout, vectorization, thread and block scheduling, and cache behavior.
Precision & Numerical Fluency Solid grasp of numerical stability, mixed precision arithmetic, and modern precision formats. Experience making principled trade-offs between precision and performance in production systems.
Profiling Fluency Comfortable with Nsight Compute, rocprof, Perfetto, VTune, and roofline modeling. You do not guess where the bottleneck is. You measure it.
Parallel Programming Breadth Strong background across parallel programming models including CUDA, Triton, SYCL, OpenCL, or OpenMP. Experience optimizing irregular algorithms such as sparse linear algebra or graph computations.
Systems Thinking Ability to reason about how individual kernels compose into larger execution graphs, and how kernel-level decisions propagate up through the inference or training stack.
Great Team Spirit A mission-driven approach to engineering, valuing clear communication, hands-on execution, and collective success over individual silos