Lead Data Engineer

Why we're hiring a Lead Data Engineer

We're building the expert intelligence layer for scientific research: a knowledge graph that connects the world to leading experts based on publications & clinical trials in precise ontologies. You'll design pipelines that ingest millions of life-science records, shaping a graph of how scientific knowledge is modelled, enriched, & served.

This is true green-fields work. Your decisions will lay the data foundations for our entire expert intelligence platform.
 

What You'll Do

You will be working at the intersection of science, data engineering & AI to build expert intelligence.

  • Own data end-to-end, design & run data pipelines turning millions of scientific records into a knowledge graph.
  • Implement precision entity resolution & enrichment, disambiguate & enrich experts from noisy data sources.
  • Utilise LLM workflows where it makes sense, for entity extraction, relationship inference & quality validation
  • Develop vector embeddings & semantic search capabilities to power expert discovery & similarity matching.
  • Model life-science entities & relationships, ontologies, author networks, publication & clinical trial metadata.
  • Build graph & vector data access, performant, accessible, reliable, observable & testable data access.
  • Move fast & ship value incrementally, done-and-iterating beats perfect-and-pending.
  • Radiate intent & document your thinking openly, collaborating async-first in a hybrid environment
  • Lead when you're the expert, follow when someone else is, challenging assumptions when necessary
  • Use AI as a daily force multiplier across coding, schema design, debugging, optimisation & validation.
  • Destroy your colleagues at Geoguessr (optional but strongly encouraged).
     

What You'll Need

Technical Skills

  • Graph Databases: Neo4j, ArangoDB, Neptune; schema design, relationship modelling, query optimisation.
  • Python Data Engineering: ETL development; pandas/polars; distributed processing with Spark or Dask.
  • Entity Resolution: Deduplication, merging, enrichment across heterogeneous scientific data sources.
  • AI-Assisted Data Extraction: LLM entity extraction, schema generation & quality validation.
  • Vector Search: Experience with Pinecone, FAISS, Qdrant, or Weaviate; embeddings, hybrid retrieval.
  • Workflow Orchestration: Robust, observable pipelines using Airflow or Dagster.
  • Data Formats & Standards: Parquet, JSONL, RDF/Turtle; selecting formats for graph & semantic use cases.
  • Embedding Models: Understanding of HuggingFace/OpenAI models, dimensionality tradeoffs & cost.
     

Executive Skills

  • Ownership mindset: Treat data & schemas as products powering multiple domains.
  • Strategic evaluation: Choose tech aligned with our scale, latency expectations, & roadmap needs.
  • Process engineering: Build reliable, repeatable & maintainable workflows.
  • Cross-functional communication: Bridge product engineers & scientific domain teams.
  • Comfort with scientific data realities: Deep rabbit holes of sprawling complexity.
     

Strong Bonus

  • Life Sciences familiarity: Publication, clinical trial, institutional, ontologies (MeSH, SNOMED, Gene Ontology).
  • Hands-on with scientific datasets: OpenAlex, PubMed/MEDLINE, ORCID, Semantic Scholar, ClinicalTrials.gov
     

Why You Might Hate It Here

  • You want predictability & routine.
  • You dislike documenting or sharing your thinking openly.
  • You see AI as a threat rather than an amplifier.
  • You're looking for a "safe" corporate environment - we're not that.

We mean this sincerely: if those points do not work, you'll be happier elsewhere.
 

Why You'll Love Working Here

  • Real Autonomy: You'll own outcomes, not tickets. This is your domain - you'll define data strategy.
  • Greenfield Opportunity: Build the from scratch. Your decisions shape our data capabilities for years.
  • Mission That Matters: Your work directly enables research - accelerating scientific breakthroughs.
  • AI-First Culture: We use AI as a creative & operational partner across every function.
  • High Impact: Every domain depends on what you build. Expert coverage directly drives our success.

Success Metrics (6-month target)

  • Expert Coverage: Knowledge graph spans 1+ million experts with rich profile data & relationships.
  • AI & Platform Enablement: AI & other domains consuming knowledge graph insights.

Required skills experience

Neo4j 1 year
ArangoDB 1 year
Python 1 year
Spark 1 year
Pinecone 1 year
FAISS 1 year
Qdrant 1 year
Airflow 1 year
Dagster 1 year

Required languages

English C1 - Advanced
Published 25 November
12 views
ยท
0 applications
To apply for this and other jobs on Djinni login or signup.
Loading...