Back

How to Choose LLM Evaluation Metrics

Jun 03, 2026
How to Choose LLM Evaluation Metrics

Choosing LLM evaluation metrics is an engineering decision. The right metrics should tell you whether a prompt, model, agent, or workflow is ready to ship, whether a change caused a regression, and whether users are getting the outcome they came for.

A weak metric set creates false confidence. A chatbot can score high on generic accuracy while still refusing valid requests, inventing citations, taking 12 seconds to respond, or spending too much per run. A strong metric set connects eval results to real product behavior.

If you need a baseline definition first, start with this guide to LLM evaluation. Then use the framework below to choose metrics that fit your application.

Start with the user outcome

Before picking metrics, write down what a successful interaction means in product terms. Avoid starting with generic labels like “accuracy” or “quality.” They are too broad to guide prompt changes or release decisions.

Use this format:

  • User goal: What is the user trying to accomplish?
  • System responsibility: What must the LLM do correctly?
  • Failure modes: What errors would hurt the user or the business?
  • Decision: What will the eval result decide? Ship, block, route, roll back, or investigate?

For example, a support answer bot should not be evaluated only on whether the answer “sounds right.” It should be evaluated on whether it uses approved policy, resolves the issue, avoids unsafe claims, and returns fast enough for a live support flow.

Application User outcome Useful metrics Poor primary metric
Customer support assistant User gets a correct, policy-compliant answer Resolution rate, groundedness, policy compliance, escalation correctness, latency Generic answer accuracy
RAG search assistant User gets an answer backed by retrieved sources Retrieval recall, citation correctness, groundedness, answer completeness Fluency score
Code generation agent Generated code passes tests and fits the repo Unit test pass rate, build success, patch size, static analysis errors, cost per task LLM judge preference only
Sales email generator Draft is accurate, usable, and on-brand Personalization accuracy, prohibited-claim rate, edit distance after review, tone compliance Open-ended quality score

Choose metrics by evaluation layer

Most LLM applications need more than one metric because they have more than one failure point. A RAG system can fail during retrieval, prompt construction, generation, formatting, or post-processing. An agent can fail during planning, tool use, state tracking, or final response generation.

Group metrics by layer so failures are easier to debug.

Layer What to measure Example metric Common failure caught
Input handling Classification, routing, extraction, validation Intent classification F1 User sent to the wrong workflow
Retrieval Whether the right context was retrieved Recall@5, source coverage Correct answer impossible because context is missing
Prompt response Correctness, completeness, instruction following Task success score, rubric score Model ignores required format or misses key facts
Grounding Whether claims are supported by context Unsupported claim rate Answer invents details outside retrieved docs
Tool use Correct tool selection and arguments Tool call accuracy Agent calls refund API with wrong account ID
Operational performance Latency, cost, errors, retry rate P95 latency, cost per successful task Prompt is correct but too slow or expensive

Connect these metrics to traces when possible. LLM observability helps you see the prompt version, model, inputs, retrieved context, tool calls, response, latency, cost, and eval result in one place.

Use a small set of primary and guardrail metrics

A practical eval suite usually has 2 to 4 primary metrics and 3 to 6 guardrail metrics.

Primary metrics decide whether the system solves the task. Guardrail metrics catch unacceptable behavior even when the main task succeeds.

Metric type Purpose Examples Release use
Primary Measure product success Task success rate, resolution rate, test pass rate Must improve or stay above threshold
Quality guardrail Prevent bad outputs Hallucination rate, policy violation rate, format failure rate Must stay below threshold
Operational guardrail Control production performance P95 latency, cost per request, timeout rate Must stay within budget
Experience guardrail Protect usability Refusal rate, clarification rate, verbosity score Must not regress

For example, a legal document summarizer might use:

  • Primary metric: Key issue coverage score greater than or equal to 4 out of 5.
  • Guardrail: Unsupported legal claim rate less than 2%.
  • Guardrail: Required disclaimer present in 99% of applicable outputs.
  • Guardrail: P95 latency under 8 seconds.
  • Guardrail: Average cost under $0.06 per summary.

Match the metric to the task type

Different LLM tasks need different scoring methods. Exact match works for classification. It fails for open-ended writing, multi-step agents, and summaries where several valid outputs may exist.

Task type Good metric choices Notes
Classification Accuracy, precision, recall, F1 Use F1 when classes are imbalanced. For example, safety violations may be rare but important.
Extraction Field-level precision and recall, JSON validity, schema compliance Score each field separately so one bad field does not hide systemic errors.
RAG answers Answer correctness, groundedness, citation accuracy, retrieval recall Separate retrieval failure from generation failure.
Summarization Coverage, factual consistency, compression ratio, prohibited omission rate Use rubrics tied to required facts, not generic readability alone.
Agents Task completion, tool call accuracy, step count, recovery rate, cost per completed task Evaluate intermediate actions, not only the final answer.
Code generation Unit test pass rate, build success, lint errors, security findings Prefer executable checks when possible.

Create rubrics that produce actionable failures

Rubrics should help engineers fix prompts and workflows. A vague 1 to 5 “quality” score does not tell you what went wrong. A useful rubric names the criteria, defines each score, and includes examples.

Here is a compact rubric for grounded RAG answers:

Score Groundedness definition Action
5 All factual claims are directly supported by retrieved sources. Pass
4 Minor wording extrapolation, but no unsupported material claim. Pass with review if high-risk domain
3 One unsupported claim that does not change the core answer. Investigate
2 One or more unsupported claims that could mislead the user. Fail
1 Answer mostly ignores or contradicts retrieved sources. Fail and block release

A judge prompt should ask for structured output. For example:

{
  "score": 2,
  "passed": false,
  "failure_reason": "The answer states that refunds are available after 60 days, but the provided policy says refunds are limited to 30 days.",
  "unsupported_claims": [
    "Refunds are available after 60 days."
  ]
}

Screenshot suggestion: judge rubric

Show a judge configuration screen with criteria for groundedness, policy compliance, and completeness. Include the score scale, pass threshold, and a sample failed judgment with the reason field expanded.

Calibrate LLM judges before trusting them

LLM judges are useful for open-ended outputs, but they are still model calls. They can be inconsistent, biased toward longer answers, overly forgiving, or too harsh on outputs that differ from their preferred wording.

If you use LLM-as-a-judge, calibrate it against a labeled set before using it as a release gate.

A basic calibration process:

  1. Collect 50 to 200 representative examples.
  2. Have domain reviewers label the expected score or pass/fail outcome.
  3. Run the LLM judge on the same examples.
  4. Compare agreement rate, false passes, and false fails.
  5. Revise the rubric and judge prompt.
  6. Freeze a judge version so metric trends stay comparable.
Calibration metric Target What it tells you
Agreement with reviewer labels 80% or higher for low-risk tasks, higher for regulated domains Whether the judge generally matches your standard
False pass rate Less than 5% for high-risk failures Whether bad outputs slip through
False fail rate Low enough to avoid blocking good releases Whether the judge creates too much noise
Score stability Same result on repeated runs at least 90% of the time Whether the judge is consistent enough for trend tracking

Do not use an uncalibrated judge as your only production quality signal. Pair it with deterministic checks, reviewer labels, user feedback, and operational metrics.

Build an eval dataset that reflects production

Your metrics are only as good as the examples they run on. Ten hand-picked examples can catch obvious regressions, but they will not tell you whether a system is ready for real traffic.

A useful eval dataset usually includes:

  • Common cases: The top user intents or workflows by traffic.
  • Edge cases: Ambiguous prompts, missing context, conflicting instructions, malformed inputs.
  • High-risk cases: Refunds, medical advice, legal claims, account access, billing changes, security-sensitive actions.
  • Regression cases: Past production failures that should never return.
  • Adversarial cases: Prompt injection, policy bypass attempts, irrelevant context, tool misuse triggers.

As a starting point, use 100 to 300 examples for a meaningful offline eval of a single prompt or workflow. For high-risk agents, use more. For quick local iteration, keep a smaller smoke test of 20 to 50 examples that runs on every prompt change.

Screenshot suggestion: eval dataset

Show a dataset table with columns for input, expected behavior, reference answer, tags, risk level, source document, and past failure ID. Include tags such as “billing,” “policy,” “prompt-injection,” and “regression.”

Set thresholds before comparing versions

Define pass thresholds before you run a model bake-off or prompt comparison. If you set thresholds after seeing the results, you can easily rationalize a risky release.

Use thresholds that match the product risk. A casual brainstorming assistant can tolerate more variability than an agent that changes customer subscriptions.

Metric Example threshold Release rule
Task success rate At least 85% Must improve by 3 percentage points or stay within 1 point of current production
Unsupported claim rate Less than 2% Any increase above threshold blocks release
JSON schema validity At least 99.5% Must not regress
P95 latency Under 4 seconds Must stay under budget for target route
Average cost per successful task Under $0.03 Must not increase by more than 10% without approval

For agent workflows, track cost per successful task instead of cost per model call. A cheaper model that needs retries, extra tool calls, or escalation may cost more in practice.

A single eval run tells you whether one version passed on one dataset at one point in time. Production systems need trend tracking. Prompts change, model providers update behavior, traffic shifts, and retrieval corpora drift.

Track metrics by prompt version, model, dataset version, route, customer segment, and release. This makes regressions easier to isolate.

Show a dashboard with task success rate, groundedness score, P95 latency, cost per request, and judge failure rate over the last 30 days. Include annotations for prompt releases and model changes.

Trend tracking also helps you catch slow degradation. For example, if retrieval recall drops after a documentation migration, answer quality may decline even though the generation prompt did not change.

Debug failures with traces

When an eval fails, engineers need the full context behind the score. Store the prompt version, input, expected output, retrieved documents, model response, tool calls, judge reasoning, latency, token usage, and cost.

A trace should answer these questions:

  • Was the input routed to the right prompt or agent?
  • Did retrieval return the needed context?
  • Did the prompt include conflicting instructions?
  • Did the model ignore the context or fail to follow the format?
  • Did a tool call use the wrong arguments?
  • Did retries or fallbacks change the final response?

Screenshot suggestion: failed response trace

Show a trace where a support assistant gives the wrong refund window. Include the user input, retrieved policy document, generated answer, judge failure reason, latency, token cost, and prompt version diff.

Common mistakes to avoid

Using only accuracy

Accuracy is useful for narrow tasks with clear labels. It is weak for open-ended generation, RAG, agents, and workflows with safety or cost constraints. Add metrics for groundedness, instruction following, latency, cost, and user outcome.

Trusting LLM judges without calibration

An LLM judge can make your eval suite faster, but it needs a rubric and a labeled comparison set. Track false passes carefully. False passes are usually more dangerous than false fails because they let bad behavior ship.

Optimizing for one metric

If you optimize only for answer completeness, the model may become verbose and slow. If you optimize only for cost, quality may drop. Use a small metric portfolio with clear tradeoff rules.

Ignoring latency and cost

A prompt that improves quality by 2 percentage points but doubles P95 latency may hurt the product. Measure latency and cost on every eval run, especially when comparing models or adding agent steps.

Evaluating on too few examples

Small datasets are good for fast iteration, but they can hide regressions. Keep a smoke test for development and a larger release eval for shipping decisions.

Choosing metrics that do not match user outcomes

A metric can be technically valid and still useless. BLEU, ROUGE, or cosine similarity may be poor fits when users care about policy compliance, correct tool use, or whether the task was completed.

A practical metric selection checklist

Use this checklist when adding or revising evals for a prompt, agent, or workflow:

  1. Write the user outcome in one sentence.
  2. List the top 5 failure modes seen in production or expected during launch.
  3. Pick 2 to 4 primary metrics tied to task success.
  4. Add guardrails for safety, grounding, format, latency, and cost.
  5. Create a dataset with common, edge, high-risk, and regression examples.
  6. Define pass thresholds before running comparisons.
  7. Calibrate any LLM judge against reviewer labels.
  8. Store traces for every failed eval.
  9. Track metric trends by prompt version, model, and dataset version.
  10. Review failures after each release and add new regression cases.

Example metric set for a RAG support assistant

Metric Type Target Scoring method
Answer resolves user issue Primary At least 85% Reviewer label or calibrated judge
Groundedness Primary Average score at least 4.3 out of 5 Calibrated judge with source context
Citation correctness Guardrail At least 95% Source matching check plus spot review
Policy violation rate Guardrail Less than 1% Rules plus judge review
Escalation correctness Guardrail At least 90% Classification against expected route
P95 latency Operational Under 5 seconds Trace data
Cost per resolved issue Operational Under $0.04 Token and provider cost data

This metric set is useful because each failure points to a likely fix. Poor groundedness may require retrieval changes or prompt constraints. Bad escalation correctness may require routing changes. High latency may require a smaller model, fewer retrieved chunks, caching, or shorter context.

Final rule: choose metrics you will act on

Every metric should support a decision. If nobody knows what to do when a metric changes, remove it or redefine it.

Good LLM metrics help your team answer practical questions:

  • Is this prompt version safer or riskier than the current one?
  • Did the new model improve task success enough to justify higher cost?
  • Which failures came from retrieval, generation, tools, or routing?
  • Can we ship this change, or should we block it?

When metrics match user outcomes, evals become part of the release process instead of a separate research task.


PromptLayer helps AI teams manage prompts, run evaluations, inspect traces, compare versions, and monitor quality over time. To start building a reliable eval workflow for your LLM application, create a PromptLayer account.

The first platform built for prompt engineering