AI Quality Analyst
What You Will Be Doing
-
Architect Automated Evaluation Frameworks: Design, implement, and maintain scalable evaluation pipelines (Evals) for LLMs and agent graphs using modern tooling like LangSmith, DeepEval, Ragas, or Opik.
-
Curate Ground-Truth Benchmarks: Collaborate with domain experts to build, version, and sanitize robust gold-standard datasets, synthetic evaluation profiles, and edge-case testing matrices reflecting real-world business scenarios.
-
Own Non-Deterministic Quality Tracking: Define, monitor, and enforce quality KPIs across multi-agent workflows—specifically focusing on tool-calling accuracy, intent-recognition safety, structured output formatting, and context-retrieval (RAG) precision.
-
Mitigate and Quantify Systemic Risk: Lead rigorous failure and hallucination analyses on production outputs. Implement structured LLM-as-Judge patterns, validation metrics, and guardrail heuristics while actively ensuring the judge profiles remain free of baseline evaluation bias.
-
Enforce CI/CD Evaluation Gates: Partner directly with MLOps and Backend Engineering teams to integrate automated testing gates into our deployment pipelines, proactively preventing regressions or behavioral drifts from reaching production runtime environments.
-
Drive Optimization for Latency & Cost: Regularly analyze the efficiency of prompt templates, few-shot structures, and model selections (e.g., GPT, Claude, LLaMA) to ensure a highly calibrated balance between execution throughput, sub-second latency, and platform compute costs.
Who You Are
-
A Data-Savvy Automation Advocate: You possess strong software engineering fundamentals and concrete Python coding experience, allowing you to seamlessly script custom evaluation routines and query multi-tenant databases.
-
An Analytical Thinker with an AI Lens: You understand that testing non-deterministic LLMs requires a completely different mindset than traditional QA. You possess deep intuition for token behaviors, retrieval dynamics, prompt engineering nuances, and failure states.
-
Radically Autonomous & Collaborative: You do not wait around for static technical specifications. You independently coordinate syncs with AI leads, domain backend engineers, and product stakeholders to identify and patch system vulnerabilities.
-
Rigorously Quality-Oriented: You hold a low ego but maintain high standards for system stability. You are deeply passionate about separating market hype from practical, measurable production metrics.
