AI QA Engineer (RAG, Agent Evaluation) Offline
We are looking for a highly skilled QA Engineer with deep experience in AI systems, particularly RAG pipelines and agent-based architectures. This role is ideal for someone who thrives in a fast-moving startup environment, is passionate about validation of intelligent systems, and wants to directly influence the reliability and performance of our AI products.
You will work closely with AI engineers, backend teams, and product stakeholders to ensure that every component of our retrieval, generation, and agent behavior stack is thoroughly tested, benchmarked, and continuously improved.
Key Responsibilities
● Design and maintain automated QA workflows for RAG components and AI-agent behavior testing.
● Define metrics and evaluation criteria for retrieval accuracy, generation relevance, and agent reliability - including faithfulness, context precision, retrieval recall, factual consistency, answer completeness, and user intent alignment.
● Apply evaluation frameworks such as Ragas, LangSmith, Weights & Biases, or equivalent tools to assess and optimize RAG pipeline performance.
● Develop custom Python scripts and utilities to test, validate, and monitor retrieval and generation quality at scale.
● Collaborate directly with AI agents, backend engineers, and data teams to identify edge cases, hallucinations, and inconsistencies in retrieval and generation workflows.
● Build and curate synthetic and real-world evaluation datasets for testing document retrieval, context injection, and agent reasoning quality.
● Create dashboards and reporting systems for performance tracking, regressions, and quality trends over time.
● Improve internal QA processes, retrieval versioning practices, and automation coverage.
● Operate effectively in a startup environment — showing adaptability, initiative, and willingness to contribute beyond defined responsibilities when required.
Qualifications
● 3–5 years of hands-on experience in software QA, AI/ML evaluation, or intelligent system testing.
● Strong Python proficiency, including scripting for automation, evaluation, and data analysis.
● Practical experience with AI evaluation tools (e.g., Ragas, LangChain eval tools, Weights & Biases, PromptLayer, etc.).
● Solid understanding of RAG architectures, vector databases, embeddings, and retrieval evaluation.
● Familiarity with modern MLOps tooling, CI/CD practices, and version control systems (GitHub, Jenkins, etc.).
● Strong analytical mindset and a data-driven approach to evaluating system quality.
● Excellent communication skills and ability to collaborate across AI engineering, data science, and product teams.
● Experience working with user-feedback loops and RLHF methodologies is a plus.
● Advanced English — strong verbal and written communication skills in English.
Required languages
| English | C1 - Advanced |
The job ad is no longer active
Look at the current jobs QA Automation →