Senior Data Engineer

Our client is a legal tech startup that focuses on AI and machine learning, specifically building chatbots to answer legal questions for lawyers. They are looking for a Senior Data Engineer for a high-impact project: digitizing law in Morocco and Africa and creating the first AI-quarriable legal knowledge base.

Their ambition is to build a platform capable of answering legal questions in a reliable, well-sourced, and traceable way, based on a massive corpus of heterogeneous legal documents.

 

🚀 Why this project is different
You will join a true “knowledge infrastructure” mission:

  • Contribute to making the law more accessible
  • Build a durable asset: a structured database of Moroccan law (in French), extensible to Africa
  • Work on a concrete and deep technical challenge: transforming unstructured data into exploitable, reliable, and maintainable data at scale

 

Required skills:

  • 3+ years of experience in Data Engineering and/or applied Document AI / NLP
  • Strong proficiency in Python
  • Hands-on experience with unstructured documents: PDF parsing, OCR, cleaning, structuring
  • Used to delivering to production: robust pipelines, observability, quality, performance
     

🛠 Stack/skills (indicative)

  • Storage: AWS
  • Document processing: OCR/parsing tools, text preprocessing pipelines
  • Testing & quality: metrics, sampling, automated validatio


Nice to have

  • Experience with legal / regulatory corpora or high-precision content
  • Familiarity with multilingual issues and encoding
  • Basic knowledge of downstream needs (vector DBs, retrieval, citation)

 

Scope of work:

You will be responsible for the “documents → structured data” pipeline that will feed our AI (RAG) engine.

 

At the core of the role (technical focus)
Build a structured database of Moroccan law in French from highly heterogeneous data:

  • PDFs (text-based and scanned), Word files, images, text files, sometimes noisy or incomplete
  • Text extraction (parsing + OCR when needed), cleaning
  • Structuring: detection of titles/chapters/sections/articles, hierarchy, normalization
  • Intelligent chunking (based on legal structure rather than arbitrary size), with traceability (source, page, identifiers)
  • Metadata: date, type of text (law/decree/circular/case law, etc.), source, version, article numbers, etc.
  • Deduplication & versioning: redundant documents, amendments, consolidated versions
  • Industrialization: orchestration, logs, retries, idempotence, monitoring, quality tests

Required skills experience

Data Engineering 4 years
NLP 3 years
Python 4 years
AWS 4 years
OCR 3 years

Required languages

English B2 - Upper Intermediate
Published 22 January
51 views
·
5 applications
50% read
·
50% responded
Last responded 3 days ago
To apply for this and other jobs on Djinni login or signup.
Loading...