GPU Performance Engineer (CUDA / AI Workloads)
$$$$
About the Role
We’re looking for a GPU Performance Engineer to optimize and scale high-performance AI workloads at the kernel level.
This role focuses on deep performance engineering — working directly with CUDA kernels, GPU architecture, and profiling tools to unlock efficiency across training and inference systems. You’ll work on compute-intensive workloads where every microsecond and memory access pattern matters.
If you enjoy low-level optimization, understanding how hardware really works, and pushing systems to their limits — this role is for you.
What You’ll Do
- Design, optimize, and debug CUDA kernels for high-performance AI workloads
- Improve performance of compute-intensive operations (e.g., GEMM, attention, MoE, graph workloads)
- Apply low-level optimization techniques across memory hierarchy, warp execution, tensor cores, and compute efficiency
- Profile and eliminate GPU bottlenecks using tools like Nsight Systems and Nsight Compute
- Integrate optimized kernels into modern AI training and inference stacks
- Build internal tooling for benchmarking, profiling, and performance analysis
- Collaborate with engineering teams on performance-critical model infrastructure
- Contribute reusable performance improvements and best practices across the stack
What We’re Looking For
- Strong hands-on experience with CUDA kernel development and optimization
- Deep understanding of GPU architecture (memory hierarchy, warp scheduling, synchronization)
- Strong C/C++ skills for high-performance systems programming
- Proven experience identifying and resolving GPU performance bottlenecks
- Experience optimizing compute-heavy workloads in real-world systems
- Familiarity with GPU profiling tools (Nsight Systems, Nsight Compute, or similar)
- Experience integrating custom kernels into ML frameworks
- Strong debugging skills and performance-oriented mindset
Nice to Have
- Experience with PTX and architecture-specific optimizations
- Familiarity with modern GPU architectures (e.g., Hopper, Blackwell)
- Experience optimizing Transformer workloads (e.g., attention, FlashAttention)
- Familiarity with Triton, CUTLASS, Thrust, or CUB
- Experience with multi-GPU systems (e.g., NVLink, NCCL)
- Open-source contributions or relevant research in GPU / HPC
Why This Role
- Work on cutting-edge AI systems with real performance impact
- Solve complex low-level engineering problems at the hardware boundary
- High ownership over performance-critical components
- Opportunity to push modern GPU systems to their limits
Required languages
| English | C1 - Advanced |
Published 15 April
8 views
·
0 applications
📊
Average salary range of similar jobs in
analytics →
Loading...