MLOps engineer (Real-Time Video Inference)

About the Project

We are building a cutting-edge, real-time AI sports commentator. This system analyzes live video streams of games to generate and deliver high-quality, natural-sounding commentary, just like a human expert.

To achieve this, we are running a complex pipeline of machine learning models that must process video in real-time. Success depends entirely on our ability to minimize latency and maximize throughput.

 

Role Summary

We are seeking a specialized MLOps Engineer with deep expertise in optimizing high-performance, low-latency inference pipelines. You will be responsible for the core infrastructure that allows our many models to run efficiently in parallel on GPUs.

 

This is not a standard MLOps role. We are looking for a specialist with a particular experience in GPU optimization, CUDA programming, and real-time streaming systems. Your primary goal is to "minimize the delay" and ensure our entire system is as efficient as possible.

 

Key Responsibilities

  • Architect & Optimize Pipelines: Design, build, and maintain our multi-model, multi-GPU inference pipelines for low-latency deployment.
  • Maximize Concurrency: Structure our streaming pipeline so that multiple models (e.g., detection, tracking, OCR) can run in parallel without blocking one another.
  • Performance Profiling & Debugging: Use tools like NVIDIA Nsight SystemsNsight Compute, and the PyTorch Profiler to identify and eliminate performance bottlenecks.
  • Solve Complex Bottlenecks: Diagnose and fix challenging issues, such as low GPU utilization while the CPU is maxed out, or performance degradation when distributing work across multiple GPUs.
  • CUDA-Level Optimization: Manage CUDA streams and synchronous/asynchronous execution to ensure GPU kernels are overlapping effectively rather than serializing.
  • Minimize Data Overhead: Implement solutions to reduce GPU↔CPU data transfer overhead, using techniques like pinned memory or zero-copy operations.
  • Deploy & Serve Models: Configure and manage Triton Inference Server to run multiple models concurrently.
  • Model Acceleration: Use TensorRT to compile and optimize models for inference, significantly improving latency compared to base PyTorch or ONNX Runtime.
  • Handle Complex Streaming Logic: Design the system to manage models with different execution frequencies (e.g., some running every frame, others every few seconds) while processing a 20 FPS video stream.
  • Ensure Data Integrity: Implement mechanisms to ensure frame order integrity when multiple asynchronous processes are handling video frames.

 

Required Qualifications & Experience

  • 3+ years of demonstrated experience in a role focused on high-performance MLOps and GPU optimization.
  • Knowledge of the NVIDIA ecosystem is essential. You must have hands-on experience with:
    • CUDA (especially streams and async execution)
    • TensorRT
    • Triton Inference Server
    • NVIDIA Profiling Tools (Nsight Systems/Compute)
  • Proven experience in debugging and optimizing complex, multi-model GPU pipelines. You should be able to diagnose why the latency of a sequential pipeline increases non-linearly and how to prioritize optimization efforts based on timing logs.
  • Strong understanding of video processing pipelines. Experience with NVIDIA DeepStream is a significant advantage.
  • Expertise in low-latency techniques, such as managing batch sizes, minimizing data transfer, and understanding the pitfalls of running multiple models on the same GPU.
  • Familiarity with tensor sharing mechanisms like DLPack.

 

What We Offer:

  • Medical Insurance in Ukraine and Multisport program in Poland;
  • Offices in Ukraine and Poland (Wroclaw);
  • All official holidays;
  • Paid vacation and sick leaves;
  • Tax & accounting services for Ukrainian contractors;
  • The company is ready to provide all the necessary equipment;
  • English classes up to three times a week;
  • Mentoring and Educational Programs;
  • Regular Activities on a Corporate level (Incredible parties, Team Buildings, Sports Events, and Tech Events);
  • Advanced Bonus System.

Required skills experience

Machine Learning 3 years
AI/ML 1 year
GPU optimisation 6 months
NVIDIA 6 months

Required domain experience

Machine Learning / Big Data 2 years

Required languages

English B2 - Upper Intermediate
NVIDIA DeepStream
Published 23 October
27 views
·
8 applications
To apply for this and other jobs on Djinni login or signup.
Loading...