Senior Multimodal ML Engineer (OCR / Layout / ASR / Vision) Offline
We’re expanding and looking for a Senior Multimodal ML Engineer to lead and own our multimodal direction across documents, images, and audio (with video on the way).
You’ll be responsible for designing and scaling pipelines for OCR & document layout recovery, image understanding/captioning, face detection/recognition, and speech-to-text.
We work primarily with open-source models deployed locally (not on managed cloud platforms), so you should be comfortable running, profiling, and optimizing everything on-premise.
We expect you to deeply understand how things work, not just how to run them. You’ll have the autonomy to define the architecture, choose the models and ensure high performance in local environments.
What You Will Do
- Design & build APIs and pipelines for OCR, document layout recovery, image captioning, face detection/recognition, and speech-to-text; prepare foundations for video-to-text.
- Implement robust OCR + structure extraction (tables, headers, coordinates, reading order) for scanned and digital documents.
- Run & serve components locally using vLLM (for LLM parts), TensorRT, ONNX Runtime, or OpenVINO – ensuring efficient on-prem inference.
- Select, fine-tune, and optimize models across CV/ASR stacks (Whisper/faster-whisper, PaddleOCR/Tesseract/Docling, YOLO/Detectron, BLIP/CLIP or similar).
- Develop scalable data pipelines for training and evaluation; build regression tests for structure accuracy, WER/CER, and latency.
- Collaborate with Data Engineering on reliable message passing (Kafka / RabbitMQ / MCP) and real-time data flow.
- Set up observability for models and infrastructure: metrics (Prometheus), dashboards (Grafana), logging (ELK Stack).
- Automate model lifecycle: CI/CD for training, validation, and deployment via GitHub Actions or GitLab CI.
- Continuously explore and evaluate new models and research, staying up to date with the latest open-source releases and applying them to real-world use cases.
What You Need to Join Us
- Strong expertise in OCR processing and document layout recovery – ability to extract both text and structural information using open-source tools like PaddleOCR, Tesseract, Docling, etc.
- Solid experience with ASR pipelines (Whisper stack), timestamps/diarization, and post-processing.
- Deep understanding of transformer/CV architectures and practical optimization techniques.
- Proven experience in fine-tuning and serving models locally (no managed ML cloud services).
- Hands-on experience with high-performance inference optimization (TensorRT/ONNX/OpenVINO; vLLM for LLM-backed pieces).
- Strong Python skills, including clean, modular service design (FastAPI, Flask, or similar).
- Understanding of distributed systems (Kafka, RabbitMQ, MCP).
- English proficiency (technical reading; conversational level is a plus).
Nice to Have
- Experience with document layout detection (YOLO/Detectron), table structure recovery, and image captioning.
- Familiarity with on-device inference (Edge AI) and optimization for limited resources (ARM, CPU-only).
- Understanding of Active/Continual Learning or multimodal RAG.
- Experience with Ray / Ray Serve for distributed inference and training.
- MLOps & Databases:
- Docker/Kubernetes, DVC, CI/CD for models;
- Experience with relational, NoSQL, and vector DBs.
Required languages
| English | B1 - Intermediate |
The job ad is no longer active
Look at the current jobs ML / AI →