GPU Performance Engineer (CUDA / AI Workloads) Offline

ITG

$$$$

About the Role

We’re looking for a GPU Performance Engineer to optimize and scale high-performance AI workloads at the kernel level.

This role focuses on deep performance engineering — working directly with CUDA kernels, GPU architecture, and profiling tools to unlock efficiency across training and inference systems. You’ll work on compute-intensive workloads where every microsecond and memory access pattern matters.

If you enjoy low-level optimization, understanding how hardware really works, and pushing systems to their limits — this role is for you.

What You’ll Do

Design, optimize, and debug CUDA kernels for high-performance AI workloads
Improve performance of compute-intensive operations (e.g., GEMM, attention, MoE, graph workloads)
Apply low-level optimization techniques across memory hierarchy, warp execution, tensor cores, and compute efficiency
Profile and eliminate GPU bottlenecks using tools like Nsight Systems and Nsight Compute
Integrate optimized kernels into modern AI training and inference stacks
Build internal tooling for benchmarking, profiling, and performance analysis
Collaborate with engineering teams on performance-critical model infrastructure
Contribute reusable performance improvements and best practices across the stack

What We’re Looking For

Strong hands-on experience with CUDA kernel development and optimization
Deep understanding of GPU architecture (memory hierarchy, warp scheduling, synchronization)
Strong C/C++ skills for high-performance systems programming
Proven experience identifying and resolving GPU performance bottlenecks
Experience optimizing compute-heavy workloads in real-world systems
Familiarity with GPU profiling tools (Nsight Systems, Nsight Compute, or similar)
Experience integrating custom kernels into ML frameworks
Strong debugging skills and performance-oriented mindset

Nice to Have

Experience with PTX and architecture-specific optimizations
Familiarity with modern GPU architectures (e.g., Hopper, Blackwell)
Experience optimizing Transformer workloads (e.g., attention, FlashAttention)
Familiarity with Triton, CUTLASS, Thrust, or CUB
Experience with multi-GPU systems (e.g., NVLink, NCCL)
Open-source contributions or relevant research in GPU / HPC

Why This Role

Work on cutting-edge AI systems with real performance impact
Solve complex low-level engineering problems at the hardware boundary
High ownership over performance-critical components
Opportunity to push modern GPU systems to their limits

Required languages

English

C1 - Advanced

The job ad is no longer active

Look at the current jobs ML / AI →

Only from 5 years of experience
Full Remote
Worldwide
Countries where we consider candidates
- English C1 - Advanced

ML / AI

Employment: Fulltime
Domain: Fintech
Outsource

Apply for the job

📊 Average salary range of similar jobs in analytics →