Key takeaways

Eval is the new MLOps. Braintrust raised $80M Feb 2026 at $800M valuation. LangSmith, Galileo, Confident AI all competing. Production LLM apps without evals fail silently — model upgrades regress quietly, prompt edits break edge cases.

Vibe checks fail in 3 modes: they catch the wrong type of bug, they normalise to the team’s preferences (not the user’s), they do not regress-test on prompt or model changes. The cost is silent quality erosion.

5 evaluation categories: reference-based (BLEU, ROUGE), reference-free (perplexity, semantic similarity), model-graded (LLM-as-judge), human evaluation, application-specific business metrics. Each catches different bug classes; mature deployments use 3+.

The golden dataset is the hardest step. 100–200 query-answer pairs covering main use cases plus edge cases, curated by domain experts. Without it, you are tuning blindly. With it, every prompt or model change is testable.

CI/CD quality gates beat post-hoc dashboards. Regression test on every deploy; block the deploy when eval scores drop. Braintrust and LangSmith both support this; building it from scratch takes 2–3 sprints.

Why Fora Soft wrote this playbook

Fora Soft has shipped 9+ production LLM applications since 2024. Voice agents on OpenAI Realtime and LiveKit Agents; RAG implementations across BrainCert, VALT and meeting platforms; MCP servers for video and surveillance.

Each ships with an evaluation harness — golden dataset, RAGAS or Braintrust integration, CI/CD quality gates. The patterns in this guide come from those engagements plus public references — Braintrust’s 2026 platform comparisons, LangSmith documentation, RAGAS framework, the McKinsey 67 % RAG adoption stat.

If you are running an LLM app in production and you are not catching regressions before customers do, this guide gives you the framework, the tooling matrix, and the practical setup — especially the often-skipped golden-dataset construction.

Need an evaluation harness for your LLM app?

Send us your LLM app + a few production traces. We will build a 50-case golden dataset and run a baseline eval in 1 week, free.

Book a 30-min call → WhatsApp → Email us →

Why ‘vibe checks’ fail

Most teams “evaluate” their LLM app by asking the team “does this look good?” on a few queries. This works for v0 demos and nothing else. Three failure modes:

1. Sample bias. The team tries the queries they think of — usually well-spoken, in-domain, happy-path. Real users send weird queries, with typos, in languages the team did not anticipate, on edge cases nobody considered. Vibe checks miss the long tail entirely.

2. Anchor effect. Once the team is happy with v1, every change is judged relative to v1. Slow degradation goes unnoticed. We have seen 6 months of model upgrades, prompt edits and retrieval tuning silently drop accuracy 15 % on the long tail without anyone noticing.

3. No regression testing. The team upgrades to GPT-4 Turbo / Claude 3.5; somebody changes a prompt; somebody adds a retrieval filter. Without an automated test suite, the only feedback is customer complaints — and customers churn before they complain.

The fix is automated, golden-dataset-based, regression-tested evaluation in CI/CD. Once in place, every prompt or model change is testable; quality drift surfaces before customers see it.

The 5 categories of LLM evaluation

CategoryExamplesWhen to use
1. Reference-basedBLEU, ROUGE, METEORTranslation, summarisation with reference text
2. Reference-freePerplexity, toxicity classifiers, semantic similarityQuality scoring without ground truth
3. Model-gradedLLM-as-judge (GPT-4 grades the output)Open-ended outputs, qualitative criteria
4. Human evaluationDomain experts annotate samples; A/B testGold-standard for quality; expensive
5. Business metricsConversion rate, containment rate, NPSProduction ground truth; lagging

Reference-based. BLEU and ROUGE compare model output to a reference. Useful for translation and summarisation where reference text exists. Limited for open-ended generation — many valid answers do not match the reference word-for-word.

Reference-free. Quality classifiers run on the output: toxicity, hallucination, semantic similarity to context. Cheaper, no reference needed, useful for filtering bad outputs in production.

Model-graded (LLM-as-judge). A separate LLM (often GPT-4) grades the output against criteria you specify (“is this answer factually grounded?” “does it address the question?”). Powerful and cheap; suffers from biases (LLMs prefer their own outputs, longer outputs, certain phrasings). Use carefully — never as the sole evaluation method.

Human evaluation. Domain experts annotate samples on dimensions you care about. Gold standard but expensive ($1–$5 per sample) and slow. Used for golden-dataset construction and periodic recalibration of automated metrics.

Business metrics. The ground truth that matters — user retention, conversion, NPS, containment rate (for voice agents). Lagging indicator; useful for confirming evaluation metrics correlate with what you actually care about.

RAG-specific evaluation (RAGAS)

RAG apps need RAG-specific metrics. The RAGAS framework (open-source) defines four core metrics:

Context relevance. Are the retrieved chunks actually relevant to the user’s query? Catches retrieval failures — the 73 % rule (when RAG fails, it’s retrieval, not generation). Target: >0.8.

Answer relevance. Does the LLM’s answer actually address the question (vs going off-topic)? Catches LLM behaviour issues. Target: >0.85.

Groundedness (faithfulness). Are the claims in the answer actually supported by the retrieved chunks? Catches hallucination. Target: >0.85.

Context recall (with reference). If you have a labelled validation set, did the retriever find the chunks containing the answer? The hardest metric to compute (needs reference labelling) but the most diagnostic for retrieval quality.

RAGAS uses LLM-as-judge under the hood. Run it on a 100–200-question golden dataset; track scores over time; alert on regressions. See our RAG for video/audio guide for the full retrieval architecture this evaluates.

Voice agent evaluation

Voice agents have evaluation dimensions that text apps do not:

Latency budget. Voice-to-voice latency p50 (<800 ms) and p95 (<1.4 s). Tracked per turn; alert on drift. See our OpenAI Realtime guide.

Tool-call accuracy. Did the agent call the right tool with the right arguments? Tracked per tool. Target: >96 % success.

Interruption handling. When user interrupts mid-agent-speech, does the agent stop within 200 ms? Track barge-in latency and false-positive interruptions.

Containment rate. Percentage of calls completed without human handoff. The single most important business metric for voice agents.

Trace replay. Capture full audio + transcript on every call; replay later for golden-dataset construction or regression tests. Helicone, LangSmith, Braintrust all support this.

Tooling matrix — Braintrust, LangSmith, Galileo, Confident AI, Helicone

ToolStrengthPricingBest for
BraintrustEval + tracing + CI/CD gates unified; $80M raised Feb 2026$249/mo flat (unlimited users)Teams running multiple LLM features; want one platform
LangSmithDeep LangChain integration; mature tracing$39/seat/mo (Plus); enterprise customLangChain-first stacks
GalileoProduction monitoring + hallucination detectionEnterprise tier; usage-basedEnterprise compliance and observability
Confident AI / DeepEvalOpen-source DeepEval frameworkFree OSS + paid hostedSelf-hosted, code-first eval
HeliconeTracing + cost tracking + cachingPer-request pricing; OSS optionCost-conscious teams; observability-first
LangfuseOpen-source tracing + eval; self-hosted optionFree OSS + paid cloudCompliance / data residency requirements
RAGASOpen-source RAG-specific frameworkFreeRAG eval in any tooling stack

Reach for Braintrust when: you have multiple LLM features, want eval + tracing + CI/CD in one platform, flat pricing for many users.

Reach for LangSmith when: stack is LangChain or LangGraph; need framework-native tracing.

Reach for Langfuse / Confident AI when: compliance demands self-hosted; you want eval data inside your VPC.

Reach for RAGAS + your platform when: RAG-specific eval is the priority; combine RAGAS metrics with Braintrust or LangSmith for full pipeline.

Regression testing in CI/CD

The pattern: every PR that touches prompts, retrieval logic, or model selection runs the eval suite against the golden dataset. PR is blocked if eval scores regress beyond threshold.

What to gate on. Top-line metrics: RAGAS faithfulness, RAGAS answer relevance, custom domain metrics. Set quality thresholds (e.g., faithfulness must stay >0.85). Set regression thresholds (any metric must not drop more than 3 % from main).

The pipeline. PR opened → CI runs eval suite (5–15 minutes for a 100-question dataset) → results posted as PR comment + status check → PR mergeable only if quality thresholds pass.

Cost control. Eval runs cost LLM tokens. A 100-question RAGAS run costs ~$0.50–$2 per execution. Control via: (1) cache identical prompts/outputs across runs, (2) run full suite on release branches; smaller suite on feature branches, (3) batch eval runs nightly for non-blocking metrics.

Production trace replay. Capture production traces; sample 5–10 % into a “recent production” eval set. Run nightly. Catches drift faster than the static golden dataset alone.

How to construct a golden dataset

The hardest part of evaluation. Done well, the golden dataset is your most valuable LLM-eng asset. Done poorly, every eval score is meaningless.

Step 1: define use cases. List 5–10 main use cases your LLM app handles. For each, list 3–5 representative queries. This gives you ~30–50 happy-path queries.

Step 2: collect edge cases. Mine production logs (or beta-tester sessions) for queries that surprised the team. Misspellings, multilingual, long, short, edge interpretations. Aim for 30–50 edge cases.

Step 3: domain-expert annotation. For each query, a domain expert (not the engineer building the app) writes the “correct” answer or labels acceptable answers. Costs $1–$5 per query for general domain; $5–$20 for medical / legal / specialised.

Step 4: review and refine. Engineering team reviews labels for consistency. Disagreements surface ambiguity in the use case definition. Iterate.

Step 5: maintain over time. Add new queries quarterly from production traces. Retire queries that no longer reflect product behaviour. Re-label annually if the product or domain shifts.

Cost of evaluation

Setup cost. Golden dataset construction: $300–$2 000 for 100–200 questions in standard domain. Specialised domains 3–5× more. Tooling integration (Braintrust / LangSmith / RAGAS): 1–2 weeks of engineering.

Run cost. Per-eval run: $0.50–$2 for 100 questions with GPT-4-class judge. Daily runs: $30–$60/mo. CI/CD per-PR runs: $0.50×number of PRs. Typical $50–$200/mo run cost at active development pace.

Tooling cost. Braintrust $249/mo flat (unlimited users). LangSmith $39/seat/mo. Galileo enterprise. Self-hosted Langfuse / Confident AI: free + ~$50/mo Postgres.

ROI. A single regression caught before customers see it pays for the entire annual eval setup. The harder ROI to justify is “we did not regress” — eval is insurance, not feature work.

Want a golden dataset built for your LLM app?

Send us your app + a few production traces. We will deliver a 50-case golden dataset and baseline eval in 1 week, free.

Book a 30-min call → WhatsApp → Email us →

Mini case — voice agent eval saves $40k/month in churn

A B2B voice-AI customer-support platform (NDA, ~3M customer calls per month) approached us in late 2025 with rising churn. Customer feedback: “the bot used to work, now it gives wrong answers.” Engineering team had been iterating on prompts and upgrading models for 6 months without an evaluation harness.

The 4-week intervention. Week 1: built golden dataset of 240 query-answer pairs across 8 use cases by mining 6 weeks of production calls; domain experts (their support managers) labelled. Week 2: integrated Braintrust + RAGAS; ran baseline eval against current production; established quality thresholds. Week 3: rebuilt CI/CD pipeline with eval gates on every PR. Week 4: ran 6-month historical regression analysis, identified the prompt change that started the quality slide.

Outcome. Faithfulness score on golden dataset jumped from 0.71 (post-regression) back to 0.89 after rolling back the bad prompt change and re-tuning. Customer-reported quality issues dropped 60 % in the next month. Churn that the team attributed to “bot quality” (~$40k/month MRR loss) reversed within 6 weeks. Investment: $32k for the 4-week project plus $250/mo Braintrust ongoing. Book a 30-min call for a similar audit.

A decision framework — pick eval stack in five questions

Q1. Stack type? LangChain-heavy: LangSmith. Mixed / agnostic: Braintrust. Self-hosted / compliance: Langfuse or Confident AI.

Q2. Number of LLM features? 1: open-source RAGAS or DeepEval is enough. 2–3: managed platform pays back. 5+: managed platform is non-optional.

Q3. Compliance posture? Standard: any cloud option. HIPAA / SOC 2 / EU residency: self-hosted Langfuse / Confident AI inside your VPC.

Q4. Team size? 1–3 engineers: code-first OSS (DeepEval, RAGAS). 4–15: managed UI helps. 15+: enterprise tier with role-based access.

Q5. Production volume? <10k LLM calls/day: any tool. >100k calls/day: prioritise tools with strong sampling and cost-control features.

Pitfalls to avoid

1. LLM-as-judge as sole metric. LLM judges have biases (longer outputs, certain phrasings, their own model family). Always combine with at least one other category — reference-based, business metric, or human evaluation.

2. Golden dataset stale within 6 months. Add new queries quarterly from production. Without refresh, you are testing yesterday’s app, not today’s.

3. No CI/CD integration. Eval that runs nightly catches regressions days late. Eval that gates PRs catches them before merge. The former is observability; the latter prevents quality drift.

4. Engineering team labels their own data. Engineers know what the LLM should do; they label charitably. Domain experts (the actual users or their proxies) label rigorously. Always use external labellers.

5. Treating eval as one-time project. Eval is continuous. Set up the harness as v0; iterate the dataset, metrics, and gates as the product evolves.

KPIs to measure

Quality KPIs. Top-line eval scores (faithfulness > 0.85, answer relevance > 0.85, etc.). Regression rate (% of PRs that fail eval gates). Time-to-detect for production drift (target: <24 hours).

Business KPIs. Customer-reported quality issues per month (target: declining trend). Containment rate (voice agents) or task-success rate (text agents). Churn attributable to LLM quality.

Reliability KPIs. Eval pipeline uptime (target: 99 %+). False positive rate on quality gates (target: <10 % — gates that block good PRs erode developer trust).

FAQ

Braintrust or LangSmith?

Braintrust for non-LangChain stacks, multi-feature teams, flat-pricing preference. LangSmith for LangChain-heavy stacks where framework-native tracing matters. Both are excellent; pick by stack alignment.

Is RAGAS enough on its own?

For RAG apps, RAGAS covers retrieval and generation quality well. Pair with a tracing platform (Helicone, Langfuse) for production observability. CI/CD integration needs Braintrust or LangSmith (or DIY Python).

How big should my golden dataset be?

100–200 questions for a v1 dataset. Less than 50 has too much variance to trust eval scores. Above 500 the marginal value drops; better to refresh quarterly than over-invest in a single huge static set.

Can I use GPT-4 to evaluate GPT-4?

Yes — this is the LLM-as-judge pattern. Caveat: GPT-4 has biases towards GPT-4 outputs (and longer outputs, and certain phrasings). Best practice: use a different model family as judge (Claude judging GPT, or vice versa) and supplement with human eval periodically.

What are A/B tests good for?

A/B tests measure user behaviour, not output quality. Good for: comparing prompt v1 vs v2 in production, validating that eval-score wins translate to user-behaviour wins. Bad for: catching regressions on edge cases (most users never hit them).

How do I evaluate without a reference answer?

Reference-free metrics: LLM-as-judge against criteria (“is this answer factually grounded?”), semantic similarity to retrieved context, hallucination classifiers, toxicity. RAGAS uses several of these under the hood. Less reliable than reference-based but applicable everywhere.

Should I evaluate every LLM call in production?

Sample-based, not exhaustive. 5–10 % sampling rate balances cost and coverage. Add 100 % sampling for high-risk paths (medical, financial advice, legal). Trace 100 % for debugging; eval-score 5–10 %.

How long does it take to set up production eval?

Greenfield: 2–3 weeks for golden dataset + Braintrust integration + CI/CD gates. Retrofit on existing app: 4–6 weeks because of golden-dataset construction lag. With our pattern reuse from prior LLM apps we typically deliver in 3–4 weeks.

Voice AI

OpenAI Realtime Production

Voice agent that needs eval against production traces.

RAG

RAG for Video & Audio

RAG architecture — this article evaluates it.

AI Infra

MCP for Video Apps

Tool-call accuracy is part of MCP server eval.

SDK

LiveKit AI Agents

Voice agent stack that eval covers end-to-end.

NFR

NFR Checklist

Eval is part of observability NFR for LLM apps.

Ready to ship LLM apps that actually hold quality?

Eval is the new MLOps. Production LLM apps without evals fail silently — model upgrades regress quietly, prompt edits break edge cases, customers churn before complaining. The 5-category framework (reference-based, reference-free, model-graded, human, business metrics) catches different bug classes; mature deployments use 3+.

Golden dataset is the hardest step and the most valuable asset. CI/CD quality gates beat post-hoc dashboards. Tooling matrix is rich (Braintrust, LangSmith, Galileo, Confident AI, Helicone, Langfuse) — pick by stack and compliance posture. RAGAS for RAG-specific metrics works alongside any platform.

Want a 4-week eval harness for your LLM app?

Send us your app, stack, production traces. We will deliver golden dataset + Braintrust integration + CI/CD gates + baseline eval in 4 weeks, fixed price.

Book a 30-min call → WhatsApp → Email us →

  • Technologies