Senior Data Engineer
Our client is a legal tech startup that focuses on AI and machine learning, specifically building chatbots to answer legal questions for lawyers. They are looking for a Senior Data Engineer for a high-impact project: digitizing law in Morocco and Africa and creating the first AI-quarriable legal knowledge base.
Their ambition is to build a platform capable of answering legal questions in a reliable, well-sourced, and traceable way, based on a massive corpus of heterogeneous legal documents.
🚀 Why this project is different
You will join a true “knowledge infrastructure” mission:
- Contribute to making the law more accessible
- Build a durable asset: a structured database of Moroccan law (in French), extensible to Africa
- Work on a concrete and deep technical challenge: transforming unstructured data into exploitable, reliable, and maintainable data at scale
Required skills:
- 3+ years of experience in Data Engineering and/or applied Document AI / NLP
- Strong proficiency in Python
- Hands-on experience with unstructured documents: PDF parsing, OCR, cleaning, structuring
- Used to delivering to production: robust pipelines, observability, quality, performance
🛠 Stack/skills (indicative)
- Storage: AWS
- Document processing: OCR/parsing tools, text preprocessing pipelines
- Testing & quality: metrics, sampling, automated validatio
⭐ Nice to have
- Experience with legal / regulatory corpora or high-precision content
- Familiarity with multilingual issues and encoding
- Basic knowledge of downstream needs (vector DBs, retrieval, citation)
Scope of work:
You will be responsible for the “documents → structured data” pipeline that will feed our AI (RAG) engine.
At the core of the role (technical focus)
Build a structured database of Moroccan law in French from highly heterogeneous data:
- PDFs (text-based and scanned), Word files, images, text files, sometimes noisy or incomplete
- Text extraction (parsing + OCR when needed), cleaning
- Structuring: detection of titles/chapters/sections/articles, hierarchy, normalization
- Intelligent chunking (based on legal structure rather than arbitrary size), with traceability (source, page, identifiers)
- Metadata: date, type of text (law/decree/circular/case law, etc.), source, version, article numbers, etc.
- Deduplication & versioning: redundant documents, amendments, consolidated versions
- Industrialization: orchestration, logs, retries, idempotence, monitoring, quality tests
Required skills experience
| Data Engineering | 4 years |
| NLP | 3 years |
| Python | 4 years |
| AWS | 4 years |
| OCR | 3 years |
Required languages
| English | B2 - Upper Intermediate |