GPU Performance Engineer (CUDA / AI Workloads)

$$$$

About the Role

We’re looking for a GPU Performance Engineer to optimize and scale high-performance AI workloads at the kernel level.

This role focuses on deep performance engineering — working directly with CUDA kernels, GPU architecture, and profiling tools to unlock efficiency across training and inference systems. You’ll work on compute-intensive workloads where every microsecond and memory access pattern matters.

If you enjoy low-level optimization, understanding how hardware really works, and pushing systems to their limits — this role is for you.

 

What You’ll Do

  • Design, optimize, and debug CUDA kernels for high-performance AI workloads
  • Improve performance of compute-intensive operations (e.g., GEMM, attention, MoE, graph workloads)
  • Apply low-level optimization techniques across memory hierarchy, warp execution, tensor cores, and compute efficiency
  • Profile and eliminate GPU bottlenecks using tools like Nsight Systems and Nsight Compute
  • Integrate optimized kernels into modern AI training and inference stacks
  • Build internal tooling for benchmarking, profiling, and performance analysis
  • Collaborate with engineering teams on performance-critical model infrastructure
  • Contribute reusable performance improvements and best practices across the stack

 

What We’re Looking For

  • Strong hands-on experience with CUDA kernel development and optimization
  • Deep understanding of GPU architecture (memory hierarchy, warp scheduling, synchronization)
  • Strong C/C++ skills for high-performance systems programming
  • Proven experience identifying and resolving GPU performance bottlenecks
  • Experience optimizing compute-heavy workloads in real-world systems
  • Familiarity with GPU profiling tools (Nsight Systems, Nsight Compute, or similar)
  • Experience integrating custom kernels into ML frameworks
  • Strong debugging skills and performance-oriented mindset

 

Nice to Have

  • Experience with PTX and architecture-specific optimizations
  • Familiarity with modern GPU architectures (e.g., Hopper, Blackwell)
  • Experience optimizing Transformer workloads (e.g., attention, FlashAttention)
  • Familiarity with Triton, CUTLASS, Thrust, or CUB
  • Experience with multi-GPU systems (e.g., NVLink, NCCL)
  • Open-source contributions or relevant research in GPU / HPC

 

Why This Role

  • Work on cutting-edge AI systems with real performance impact
  • Solve complex low-level engineering problems at the hardware boundary
  • High ownership over performance-critical components
  • Opportunity to push modern GPU systems to their limits

 

Required languages

English C1 - Advanced
Published 15 April
8 views
·
0 applications
To apply for this and other jobs on Djinni login or signup.
Loading...