Senior AI/ML Engineer (Multimodal AI / LLM Evaluation / Trust and Safety)
We are looking for a Senior AI/ML Engineer experienced in multimodal AI pipelines, LLM evaluation frameworks, agent orchestration, and scalable Python backend systems on AWS.
A Korean company is building and hardening an AI verification and compliance layer for AIGC/UGC financial content.
The system includes:
- four evaluation/verification models
- an automated compliance rule engine
- a Trust & Safety moderation pipeline
The core architecture processes video-based content through a multi-stage pipeline:
STT โ OCR โ image/video analysis โ financial rules engine โ LLM-based decision layer
Final output decision types:
- Auto-Block
- Auto-Pass
- Human Review
This role requires strong production engineering skills combined with hands-on experience in multimodal LLM evaluation systems.
Required Skills & Experience
- Multimodal LLM evaluation pipeline design
(LLM-as-a-judge, rubric-based scoring, prompt/few-shot optimization, schema design, agent orchestration) - Model evaluation engineering
(golden dataset creation, metric definition: precision/recall/human agreement, A/B testing, benchmarking automation) - AI serving & inference optimization
(AWS Bedrock integration, vLLM / GPU inference, latency optimization, cost-aware design, fallback strategies) - Strong Python backend engineering (~7โ10 years)
(async systems, queues, microservices, service-oriented architecture) - Trust & Safety / compliance systems
(rule engines: keyword + context IF-THEN logic, moderation workflows, HITL queues, severity-based escalation)
Nice to Have
- Experience fine-tuning or improving LLM / video understanding models
- Background in content moderation or compliance systems
- Experience with Korean language content processing
(STT, OCR optimization, NSFW detection pipelines)
Tech Stack
Languages / APIs: Python (FastAPI), React (TypeScript)
LLM / Multimodal: AWS Bedrock, VLMs (e.g. Qwen2.5-Omni), vLLM
Agent frameworks: LangGraph, CrewAI, Google ADK, LangChain
Evaluation: LLM-as-judge, rubric scoring, DeepEval-style eval harnesses
Cloud / Infra: AWS Bedrock, SageMaker, RDS (PostgreSQL), S3, EKS
Observability: Langfuse, prompt testing & tracing tools
Data platform: Databricks
Media processing: Whisper, OCR pipelines, image moderation tools (e.g. AWS Rekognition)
Required languages
| English | B2 - Upper Intermediate |
| Ukrainian | Native |