AI Dataset Architect / Metadata / Ontology Specialist (Multimodal / Vision)
We are building a premium, large-scale dataset of ~3 million images spanning art, culture, and history (paintings, objects, architecture, historical scenes, etc.) with the goal of licensing it to leading AI companies (OpenAI, Meta, Google, Amazon, Apple, and others) for LLM and multimodal model training.
Our focus is to structure the collection for modern requirements in multimodal training, alignment, retrieval (RAG), and multimodal reasoning — and to make the dataset “buyer-ready” for enterprise AI labs.
Role mission
We’re looking for a specialist who understands what AI model developers truly need from training datasets and can help us design and refine the dataset’s structure, metadata, annotation standards, documentation, packaging, and legal readiness for commercial licensing.
Key responsibilities
- Advise on which image + metadata pairs are most valuable for VLM/LLM developers.
- Design or refine a scalable metadata schema (JSON-based; optionally aligned with LIDO / CDWA / CIDOC CRM) suitable for:
- Vision-Language Models (VLMs)
- Retrieval / RAG workflows
- Fine-tuning and alignment
- Review and improve our current schema (~17 fields per image across visual, cultural, technical, and legal metadata).
- Define annotation standards for:
- Objects/attributes and composition
- Style, period, and cultural context
- Multi-level captioning (short / detailed / expert-grade)
- VQA (Visual Question Answering) pairs
- Synthetic negatives for robustness
- Recommend packaging and delivery formats preferred by major AI labs (JSONL, Parquet, TFRecord, etc.).
- Ensure compliance with FAIR principles (Findable, Accessible, Interoperable, Reusable).
- Provide guidance on provenance, licensing language, and legal compliance for AI training usage.
- Help position the dataset as a premium, commercially licensable AI training asset.
Required expertise
Demonstrated experience in one or more of:
- LLM or multimodal training data (vision-language datasets, CLIP-style, VQA, DETR, etc.)
- Dataset design for AI labs, research institutions, or large-scale ML pipelines
- Cultural heritage / museum / archival metadata standards (LIDO, CDWA, CIDOC CRM, Getty AAT)
- Structuring datasets for zero-shot learning, cross-modal reasoning, and multilingual AI
- Data licensing for AI training (public domain, CC0, custom licenses)
- Data evaluation, benchmarking, or alignment
Nice to have
- Prior work with OpenAI, Meta, Google, Amazon, Apple, or similar AI organizations
- Experience with large image datasets (100k+ items)
- Familiarity with RAG systems and embedding-based retrieval
- Experience preparing datasets for commercial sale or enterprise clients
Deliverables
Depending on scope, deliverables may include:
- Written recommendations on dataset structure and field definitions
- Improved/final metadata schema with JSON examples
- Annotation and captioning guidelines
- Enterprise-grade dataset documentation for AI buyers
- Optional: review of a sample subset of images and metadata
Required languages
| English | B2 - Upper Intermediate |