MLOps engineer (Real-Time Video Inference)
About the Project
We are building a cutting-edge, real-time AI sports commentator. This system analyzes live video streams of games to generate and deliver high-quality, natural-sounding commentary, just like a human expert.
To achieve this, we are running a complex pipeline of machine learning models that must process video in real-time. Success depends entirely on our ability to minimize latency and maximize throughput.
Role Summary
We are seeking a specialized MLOps Engineer with deep expertise in optimizing high-performance, low-latency inference pipelines. You will be responsible for the core infrastructure that allows our many models to run efficiently in parallel on GPUs.
This is not a standard MLOps role. We are looking for a specialist with a particular experience in GPU optimization, CUDA programming, and real-time streaming systems. Your primary goal is to "minimize the delay" and ensure our entire system is as efficient as possible.
Key Responsibilities
- Architect & Optimize Pipelines: Design, build, and maintain our multi-model, multi-GPU inference pipelines for low-latency deployment.
- Maximize Concurrency: Structure our streaming pipeline so that multiple models (e.g., detection, tracking, OCR) can run in parallel without blocking one another.
- Performance Profiling & Debugging: Use tools like NVIDIA Nsight Systems, Nsight Compute, and the PyTorch Profiler to identify and eliminate performance bottlenecks.
- Solve Complex Bottlenecks: Diagnose and fix challenging issues, such as low GPU utilization while the CPU is maxed out, or performance degradation when distributing work across multiple GPUs.
- CUDA-Level Optimization: Manage CUDA streams and synchronous/asynchronous execution to ensure GPU kernels are overlapping effectively rather than serializing.
- Minimize Data Overhead: Implement solutions to reduce GPU↔CPU data transfer overhead, using techniques like pinned memory or zero-copy operations.
- Deploy & Serve Models: Configure and manage Triton Inference Server to run multiple models concurrently.
- Model Acceleration: Use TensorRT to compile and optimize models for inference, significantly improving latency compared to base PyTorch or ONNX Runtime.
- Handle Complex Streaming Logic: Design the system to manage models with different execution frequencies (e.g., some running every frame, others every few seconds) while processing a 20 FPS video stream.
- Ensure Data Integrity: Implement mechanisms to ensure frame order integrity when multiple asynchronous processes are handling video frames.
Required Qualifications & Experience
- 3+ years of demonstrated experience in a role focused on high-performance MLOps and GPU optimization.
- Knowledge of the NVIDIA ecosystem is essential. You must have hands-on experience with:
- CUDA (especially streams and async execution)
- TensorRT
- Triton Inference Server
- NVIDIA Profiling Tools (Nsight Systems/Compute)
- Proven experience in debugging and optimizing complex, multi-model GPU pipelines. You should be able to diagnose why the latency of a sequential pipeline increases non-linearly and how to prioritize optimization efforts based on timing logs.
- Strong understanding of video processing pipelines. Experience with NVIDIA DeepStream is a significant advantage.
- Expertise in low-latency techniques, such as managing batch sizes, minimizing data transfer, and understanding the pitfalls of running multiple models on the same GPU.
- Familiarity with tensor sharing mechanisms like DLPack.
What We Offer:
- Medical Insurance in Ukraine and Multisport program in Poland;
- Offices in Ukraine and Poland (Wroclaw);
- All official holidays;
- Paid vacation and sick leaves;
- Tax & accounting services for Ukrainian contractors;
- The company is ready to provide all the necessary equipment;
- English classes up to three times a week;
- Mentoring and Educational Programs;
- Regular Activities on a Corporate level (Incredible parties, Team Buildings, Sports Events, and Tech Events);
- Advanced Bonus System.
Required skills experience
| Machine Learning | 3 years |
| AI/ML | 1 year |
| GPU optimisation | 6 months |
| NVIDIA | 6 months |
Required domain experience
| Machine Learning / Big Data | 2 years |
Required languages
| English | B2 - Upper Intermediate |