Senior NLP / IR (RAG) Engineer
MilTech
🪖
We’re expanding and looking for a Senior NLP / IR (RAG) Engineer to lead and own our text-focused ML direction.
You’ll be responsible for designing and scaling the text stack – from research to production – driving innovation across retrieval, reranking, and generative pipelines.
We work primarily with open-source models deployed locally (not on managed cloud platforms), so it would be perfect if you're comfortable running, profiling, and optimizing everything on-premise.
We expect you to deeply understand how things work, not just how to run them. You’ll have the autonomy to define the architecture, choose the models, and ensure high performance in local environments.
What You Will Do
- Design & build APIs and pipelines for summarization, classification, NER, QA, and RAG chat systems.
- End-to-end Rag:
- chunking/normalization, index construction;
- hybrid retrieval (BM25 + vector), reranking (BGE/ColBERT, etc.), context policies, caching, latency budgeting, and offline evaluation (RAGAS/TruLens).
- Run and serve models locally using vLLM, TensorRT, ONNX Runtime – ensuring efficient inference on our own GPU servers.
- Select, fine-tune, and optimize transformer models (LLaMA, Mistral, Falcon, DeepSeek, Gemma, etc.) for specific domains.
- Develop scalable data pipelines for model training and evaluation: annotation, augmentation, class balancing, and dataset curation.
- Collaborate with Data Engineering on reliable message passing (Kafka / RabbitMQ / MCP) and real-time data flow.
- Set up observability for models and infrastructure: metrics (Prometheus), dashboards (Grafana), logging (ELK Stack).
- Automate model lifecycle: CI/CD for training, validation, and deployment via GitHub Actions or GitLab CI.
- Continuously explore and evaluate new models and research, staying up to date with the latest open-source releases and applying them to real-world use cases.
What You Need to Join Us
- Strong expertise in NLP and Information Retrieval (RAG, hybrid retrieval, reranking).
- Deep understanding of transformer architectures and practical optimization techniques.
- Experience in fine-tuning and serving models locally (no managed ML cloud services).
- Hands-on experience with vLLM and high-performance inference optimization.
- Strong Python skills, including clean, modular service design (FastAPI, Flask, or similar).
- Understanding of distributed systems (Kafka, RabbitMQ, MCP).
- English proficiency (technical reading; conversational level is a plus).
Nice to Have
- Experience building custom NER or QA models from scratch.
- Familiarity with on-device inference (Edge AI) and optimization for limited resources (ARM, CPU-only).
- Understanding of Active Learning, Continual Learning, and RAG evaluation frameworks.
- Experience with Ray / Ray Serve for distributed inference and training.
- MLOps & Databases (plus):
- Docker/Kubernetes, DVC, CI/CD for models;
- Experience with relational, NoSQL, and vector DBs.
Required languages
| English | B1 - Intermediate |
Published 5 November
40 views
·
2 applications
📊
Average salary range of similar jobs in
analytics →
Loading...