Back

How to Pick the Right LLM Evals

Jun 05, 2026
How to Pick the Right LLM Evals

How to Pick the Right LLM Evals

Picking the right LLM evals starts with a specific question: what would make this AI workflow unsafe, useless, too expensive, or too slow in production?

If your eval suite cannot answer that question, it will give you false confidence. A 92% pass rate looks good until you learn the test set only covers clean inputs, short prompts, happy-path tool calls, and examples from last quarter’s product behavior.

Good LLM evaluation connects product risk to measurable checks. It should tell you whether a prompt, model, retrieval change, tool schema, or agent workflow got better or worse before users feel the impact.

Start with failure modes, not generic metrics

Many teams start with broad metrics such as “accuracy,” “helpfulness,” or “quality.” Those labels are too vague for production LLM systems. They do not tell you what failed, who it affected, or what to fix.

Start by listing the ways your workflow can fail. Then map each failure mode to an eval that can catch it.

Example: failure-mode-to-eval mapping

Workflow Failure mode Eval type Pass condition Owner
Support answer generation Invents refund policy details Reference-based factuality check No unsupported policy claim Support engineering
SQL agent Runs a destructive query Static query policy eval No DELETE, DROP, UPDATE, or unrestricted write query Data platform
RAG assistant Misses the key document in a long context window Retrieval and answer attribution eval Uses the expected source and cites it AI platform
Agentic onboarding flow Calls the wrong tool after user correction Tool-call sequence eval Correct tool name and required arguments Product engineering

Recommended visual: include a screenshot of this mapping table in your internal eval spec. It helps product, engineering, and QA agree on what the eval suite is supposed to protect.

Pick evals based on the job your LLM performs

Different LLM features need different evals. A chatbot, extraction pipeline, coding agent, and retrieval workflow should not share one generic test suite.

For classification and routing

  • Use: exact match, precision, recall, confusion matrix, and per-class pass rates.
  • Watch for: rare classes with poor recall. A router can look strong overall while missing urgent cases.
  • Example: a support triage classifier should report separate recall for “billing dispute,” “account takeover,” and “legal request.”

For extraction

  • Use: field-level accuracy, schema validity, null handling, and tolerance rules for dates, currency, and names.
  • Watch for: valid JSON with wrong values. Schema checks alone are not enough.
  • Example: invoice extraction should score vendor_name, invoice_total, currency, and due_date separately.

For retrieval-augmented generation

  • Use: retrieval recall, context relevance, citation correctness, answer faithfulness, and refusal behavior when context is missing.
  • Watch for: the model answering from prior knowledge when it should use retrieved documents.
  • Example: if the expected policy document was retrieved at rank 7 but your context only includes the top 5 chunks, the answer eval may fail for the wrong reason. Track retrieval and generation separately.

For long-context prompts

  • Use: targeted needle-in-context tests, source position analysis, and multi-document reasoning checks.
  • Watch for: failures caused by lost-in-the-middle, where the model misses important information placed deep inside the context.
  • Example: place the decisive policy clause near the beginning, middle, and end of the prompt, then compare pass rates by position.

For agents and tool use

  • Use: tool selection accuracy, argument validation, step count, task completion, recovery after tool errors, and budget limits.
  • Watch for: workflows that pass final-answer checks but waste five unnecessary tool calls.
  • Example: a calendar agent should be scored on whether it asks for missing timezone information before creating an event.

Do not evaluate only happy-path examples

A test set with clean, obvious examples will hide the failures your users actually report. Your eval set should include normal traffic, edge cases, adversarial inputs, and known regressions.

A practical starting split for many production LLM apps is:

  • 50% common production cases: the routine requests your system sees every day.
  • 20% edge cases: ambiguous wording, missing fields, unusual formatting, long inputs, multilingual inputs, or unsupported requests.
  • 20% historical failures: examples from bug reports, support escalations, and incident reviews.
  • 10% adversarial or abuse cases: prompt injection, policy bypass attempts, unsafe requests, or malicious tool arguments.

Do not freeze this mix forever. If your production traffic changes, your eval set should change too. For example, a customer support bot that starts handling enterprise plan questions needs examples for contract terms, SSO setup, audit logs, and account permissions.

Build a labeled golden dataset

Unlabeled or stale datasets create noisy evals. If no one knows the expected behavior, your pass rate means very little.

A golden dataset should include the input, expected output or scoring criteria, metadata, and the reason the example exists. Metadata matters because it lets you slice results by customer segment, language, product area, risk level, prompt version, and model version.

Example: golden dataset row

Field Example value
example_id support_refund_0142
input “I bought the annual plan 42 days ago. Can I get a refund?”
expected_behavior State that the standard refund window is 30 days. Do not promise a refund. Offer to connect the user with billing support.
must_not_include “You are eligible for a refund”
source_document refund_policy_v3
risk_level high
created_from production escalation
last_reviewed_at 2026-05-18

Recommended visual: show one sample golden dataset row with labels, expected behavior, source document, and review date. This makes it clear that eval quality depends on dataset quality.

Use model graders carefully

LLM-as-judge evaluation can be useful when exact matching is too rigid. It works well for criteria such as tone, instruction following, answer completeness, and faithfulness to supplied context. It can also create noise if you treat the judge as automatically correct.

When you use LLM-as-a-judge, define a rubric, calibrate it against labeled examples, and track judge agreement over time.

Example: LLM-as-judge rubric

Score Meaning Decision rule
1 Incorrect or unsafe Contradicts the source, invents policy, exposes sensitive data, or follows a malicious instruction.
2 Partially correct Answers part of the request but misses a required condition, caveat, or source detail.
3 Correct Answers the user’s request using the supplied context and follows all required constraints.

For calibration, take 100 examples and have domain reviewers label them. Then compare the judge result against those labels. If the judge agrees 70% of the time, do not use it as a hard release gate for high-risk behavior. Improve the rubric, add examples to the judge prompt, or restrict the judge to narrower criteria.

Recommended visual: include a screenshot of the judge prompt, rubric, and a few graded outputs. Show at least one disagreement between the judge and a reviewer, because that is where teams learn how reliable the judge really is.

Separate deterministic checks from judgment-based checks

Some evals should be simple code. Do not ask a model to judge whether JSON is valid, whether a required field exists, or whether a tool argument matches an enum.

Use deterministic checks for:

  • JSON schema validity
  • Required fields
  • Regex patterns
  • Blocked terms
  • Tool names and argument types
  • Token count limits
  • Latency thresholds
  • Cost thresholds

Use model-graded checks for:

  • Faithfulness to retrieved context
  • Completeness
  • Helpfulness within a narrow task definition
  • Tone and brand constraints
  • Correct refusal behavior

A strong eval suite usually combines both. For example, a support answer can pass schema validation, pass citation checks, and then receive a model-graded score for whether it answered the user’s actual question.

Include cost and latency in your eval suite

Accuracy is not enough. A prompt that improves pass rate by 2 points but doubles cost may be a bad trade. An agent that produces better answers but adds 12 seconds of latency may hurt conversion or support deflection.

Track these metrics in every eval run:

  • Average cost per request: for example, $0.018 per completed support answer.
  • p50 and p95 latency: for example, 1.8 seconds p50 and 7.4 seconds p95.
  • Token counts: input tokens, output tokens, retrieved context tokens, and tool-call overhead.
  • Retry rate: retries caused by JSON parsing, tool errors, rate limits, or safety filters.
  • Step count: especially for agents that can loop or call multiple tools.

Use release thresholds that reflect your product. A backend summarization job may tolerate 20 seconds of latency. An autocomplete feature may need responses under 300 milliseconds. A user-facing chat agent may need a fast first token and predictable p95 latency.

Version prompts and eval criteria together

One common mistake is versioning prompts but leaving eval criteria floating in a document, spreadsheet, or test file with no clear history. That makes regressions hard to explain.

Every eval run should record:

  • Prompt version
  • Model and model version
  • Dataset version
  • Eval criteria version
  • Judge prompt version, if you use a model grader
  • Retrieval configuration
  • Tool schemas
  • Runtime parameters such as temperature, max tokens, and timeout

This is where LLM observability becomes part of evaluation. You need enough trace data to compare two runs and know what changed. Without that record, a pass-rate drop can turn into hours of guesswork.

Run evals in CI before prompt and model changes ship

LLM evals should run where engineering decisions happen. For many teams, that means CI on prompt changes, model changes, retrieval changes, and tool schema changes.

Example: CI eval result

Eval suite Baseline Candidate Threshold Status
Refund policy factuality 96.0% 94.5% 95.0% Fail
JSON schema validity 99.8% 99.9% 99.0% Pass
p95 latency 4.2s 5.8s 5.0s max Fail
Average cost $0.012 $0.015 $0.016 max Pass

Recommended visual: include a CI screenshot that shows pass and fail rows by eval suite. The most useful screenshots show the baseline, candidate, threshold, and exact examples that regressed.

Do not block every small change on every eval. Use tiers:

  • Smoke evals: 20 to 50 examples that run on every pull request in a few minutes.
  • Core regression evals: 200 to 1,000 examples that run before release.
  • Full evals: larger suites that run nightly or before major model migrations.

Use dashboards that compare versions

A single aggregate pass rate hides the patterns you need. Your dashboard should break results down by prompt version, model version, dataset slice, failure mode, cost, and latency.

Example: dashboard breakdown

Prompt version Model Overall pass rate High-risk pass rate p95 latency Avg cost
support_v12 gpt-4.1-mini 91.8% 86.0% 4.9s $0.010
support_v13 gpt-4.1-mini 94.2% 93.5% 5.2s $0.011
support_v13 claude-3-5-sonnet 95.1% 94.0% 6.8s $0.019

Recommended visual: include a dashboard screenshot showing pass rates by prompt or model version. Add filters for dataset slice, failure mode, and risk level. This lets your team see whether a change improved common cases while making high-risk cases worse.

Choose thresholds based on risk

Do not use the same pass threshold for every eval. A harmless formatting issue and an unsafe tool call should not carry the same release weight.

Example thresholds:

  • Safety and privacy: 100% pass required on the critical test set.
  • Destructive tool calls: 100% pass required before release.
  • Policy factuality: 98% or higher for high-risk flows.
  • Extraction schema validity: 99% or higher if downstream systems depend on the output.
  • Tone: lower threshold may be acceptable if failures are reviewed and low-risk.
  • Latency: set p95 limits based on user experience, not average latency.

For high-risk systems, require example-level review for failures. A pass rate alone is not enough. One failure that leaks sensitive data or runs the wrong transaction can matter more than 200 clean examples.

Refresh evals with production data

Your eval suite should grow as your product grows. Add examples when users complain, when a model behaves unexpectedly, when new product features ship, and when retrieval content changes.

A simple maintenance rhythm works well:

  • Weekly: add recent production failures and support escalations.
  • Monthly: review stale examples and update expected behavior against current product policy.
  • Before model migrations: run full evals and inspect failures by category.
  • After incidents: add regression examples that would have caught the issue.

Keep old failures unless they no longer reflect your product. Regression examples are valuable because they protect against bugs that return after a prompt rewrite, model swap, or tool change.

A practical checklist for picking LLM evals

  1. Define the workflow: What task does the LLM perform, and what systems does it touch?
  2. List failure modes: Include factual errors, unsafe outputs, wrong tool calls, bad formatting, latency, and cost.
  3. Map each failure to an eval: Use deterministic checks where possible and model graders where judgment is needed.
  4. Create a labeled dataset: Include expected behavior, source references, risk level, and review dates.
  5. Calibrate model graders: Compare judge output against labeled examples before using it as a gate.
  6. Track versions: Record prompt, model, dataset, eval criteria, judge prompt, retrieval config, and tool schemas.
  7. Run evals in CI: Use smoke, regression, and full suites based on release risk.
  8. Monitor cost and latency: Treat performance and budget as release criteria.
  9. Compare slices: Break down pass rates by failure mode, prompt version, model version, customer segment, and risk level.
  10. Keep the suite fresh: Add production failures and remove examples that no longer match current behavior.

The right evals make failures actionable

The best eval suite is not the one with the most metrics. It is the one that helps your team decide whether a specific LLM change is safe to ship.

If a prompt change fails, you should know which behavior regressed. If a model swap improves quality, you should know the cost and latency tradeoff. If a judge disagrees with reviewers, you should know whether the rubric needs work. If production behavior changes, your eval dataset should change with it.

Start small if needed. Pick one important workflow, list its top 10 failure modes, create 50 labeled examples, and run those evals on every prompt or model change. That is enough to catch real regressions and build the habit of evaluating before shipping.


PromptLayer helps AI teams manage prompts, datasets, evals, traces, and version history in one place, so you can compare changes before they reach production. To start building reliable eval workflows, create a PromptLayer account.

The first platform built for prompt engineering