Senior AI/ML Engineer (Multimodal AI / LLM Evaluation / Trust and Safety)

$$$$

We are looking for a Senior AI/ML Engineer experienced in multimodal AI pipelines, LLM evaluation frameworks, agent orchestration, and scalable Python backend systems on AWS.

A Korean company is building and hardening an AI verification and compliance layer for AIGC/UGC financial content.

The system includes:

four evaluation/verification models
an automated compliance rule engine
a Trust & Safety moderation pipeline

The core architecture processes video-based content through a multi-stage pipeline:

STT → OCR → image/video analysis → financial rules engine → LLM-based decision layer

Final output decision types:

Auto-Block
Auto-Pass
Human Review

This role requires strong production engineering skills combined with hands-on experience in multimodal LLM evaluation systems.

Required Skills & Experience

Multimodal LLM evaluation pipeline design
(LLM-as-a-judge, rubric-based scoring, prompt/few-shot optimization, schema design, agent orchestration)
Model evaluation engineering
(golden dataset creation, metric definition: precision/recall/human agreement, A/B testing, benchmarking automation)
AI serving & inference optimization
(AWS Bedrock integration, vLLM / GPU inference, latency optimization, cost-aware design, fallback strategies)
Strong Python backend engineering (~7–10 years)
(async systems, queues, microservices, service-oriented architecture)
Trust & Safety / compliance systems
(rule engines: keyword + context IF-THEN logic, moderation workflows, HITL queues, severity-based escalation)

Nice to Have

Experience fine-tuning or improving LLM / video understanding models
Background in content moderation or compliance systems
Experience with Korean language content processing
(STT, OCR optimization, NSFW detection pipelines)

Tech Stack

Languages / APIs: Python (FastAPI), React (TypeScript)
LLM / Multimodal: AWS Bedrock, VLMs (e.g. Qwen2.5-Omni), vLLM
Agent frameworks: LangGraph, CrewAI, Google ADK, LangChain
Evaluation: LLM-as-judge, rubric scoring, DeepEval-style eval harnesses
Cloud / Infra: AWS Bedrock, SageMaker, RDS (PostgreSQL), S3, EKS
Observability: Langfuse, prompt testing & tracing tools
Data platform: Databricks
Media processing: Whisper, OCR pipelines, image moderation tools (e.g. AWS Rekognition)

Required languages

English B2 - Upper Intermediate

Ukrainian Native

Published 5 June · Updated 12 June

124 views

23 applications

Response activity: Medium

Last responded yesterday

See stats of candidates who applied for this job 👀

See applicant insights

To apply for this and other jobs on Djinni login or signup.

Only from 6 years of experience
Full Remote
Worldwide
Countries where we consider candidates
- English B2 - Upper Intermediate
- Ukrainian Native

ML / AI

Employment: Fulltime
Domain: Other
Outstaff
Test task is needed

Apply for the job

Response activity: Medium

Last responded yesterday

📊 $3500-6000 Average salary range of similar jobs in analytics →