How to Choose LLM Evaluation Metrics
Choosing LLM evaluation metrics is an engineering decision. The right metrics should tell you whether a prompt, model, agent, or workflow is ready to ship, whether a change caused a regression, and whether users are getting the outcome they came for.
A weak metric set creates false confidence. A chatbot can score high on generic accuracy while still refusing valid requests, inventing citations, taking 12 seconds to respond, or spending too much per run. A strong metric set connects eval results to real product behavior.
If you need a baseline definition first, start with this guide to LLM evaluation. Then use the framework below to choose metrics that fit your application.
Start with the user outcome
Before picking metrics, write down what a successful interaction means in product terms. Avoid starting with generic labels like “accuracy” or “quality.” They are too broad to guide prompt changes or release decisions.
Use this format:
- User goal: What is the user trying to accomplish?
- System responsibility: What must the LLM do correctly?
- Failure modes: What errors would hurt the user or the business?
- Decision: What will the eval result decide? Ship, block, route, roll back, or investigate?
For example, a support answer bot should not be evaluated only on whether the answer “sounds right.” It should be evaluated on whether it uses approved policy, resolves the issue, avoids unsafe claims, and returns fast enough for a live support flow.
| Application | User outcome | Useful metrics | Poor primary metric |
|---|---|---|---|
| Customer support assistant | User gets a correct, policy-compliant answer | Resolution rate, groundedness, policy compliance, escalation correctness, latency | Generic answer accuracy |
| RAG search assistant | User gets an answer backed by retrieved sources | Retrieval recall, citation correctness, groundedness, answer completeness | Fluency score |
| Code generation agent | Generated code passes tests and fits the repo | Unit test pass rate, build success, patch size, static analysis errors, cost per task | LLM judge preference only |
| Sales email generator | Draft is accurate, usable, and on-brand | Personalization accuracy, prohibited-claim rate, edit distance after review, tone compliance | Open-ended quality score |
Choose metrics by evaluation layer
Most LLM applications need more than one metric because they have more than one failure point. A RAG system can fail during retrieval, prompt construction, generation, formatting, or post-processing. An agent can fail during planning, tool use, state tracking, or final response generation.
Group metrics by layer so failures are easier to debug.
| Layer | What to measure | Example metric | Common failure caught |
|---|---|---|---|
| Input handling | Classification, routing, extraction, validation | Intent classification F1 | User sent to the wrong workflow |
| Retrieval | Whether the right context was retrieved | Recall@5, source coverage | Correct answer impossible because context is missing |
| Prompt response | Correctness, completeness, instruction following | Task success score, rubric score | Model ignores required format or misses key facts |
| Grounding | Whether claims are supported by context | Unsupported claim rate | Answer invents details outside retrieved docs |
| Tool use | Correct tool selection and arguments | Tool call accuracy | Agent calls refund API with wrong account ID |
| Operational performance | Latency, cost, errors, retry rate | P95 latency, cost per successful task | Prompt is correct but too slow or expensive |
Connect these metrics to traces when possible. LLM observability helps you see the prompt version, model, inputs, retrieved context, tool calls, response, latency, cost, and eval result in one place.
Use a small set of primary and guardrail metrics
A practical eval suite usually has 2 to 4 primary metrics and 3 to 6 guardrail metrics.
Primary metrics decide whether the system solves the task. Guardrail metrics catch unacceptable behavior even when the main task succeeds.
| Metric type | Purpose | Examples | Release use |
|---|---|---|---|
| Primary | Measure product success | Task success rate, resolution rate, test pass rate | Must improve or stay above threshold |
| Quality guardrail | Prevent bad outputs | Hallucination rate, policy violation rate, format failure rate | Must stay below threshold |
| Operational guardrail | Control production performance | P95 latency, cost per request, timeout rate | Must stay within budget |
| Experience guardrail | Protect usability | Refusal rate, clarification rate, verbosity score | Must not regress |
For example, a legal document summarizer might use:
- Primary metric: Key issue coverage score greater than or equal to 4 out of 5.
- Guardrail: Unsupported legal claim rate less than 2%.
- Guardrail: Required disclaimer present in 99% of applicable outputs.
- Guardrail: P95 latency under 8 seconds.
- Guardrail: Average cost under $0.06 per summary.
Match the metric to the task type
Different LLM tasks need different scoring methods. Exact match works for classification. It fails for open-ended writing, multi-step agents, and summaries where several valid outputs may exist.
| Task type | Good metric choices | Notes |
|---|---|---|
| Classification | Accuracy, precision, recall, F1 | Use F1 when classes are imbalanced. For example, safety violations may be rare but important. |
| Extraction | Field-level precision and recall, JSON validity, schema compliance | Score each field separately so one bad field does not hide systemic errors. |
| RAG answers | Answer correctness, groundedness, citation accuracy, retrieval recall | Separate retrieval failure from generation failure. |
| Summarization | Coverage, factual consistency, compression ratio, prohibited omission rate | Use rubrics tied to required facts, not generic readability alone. |
| Agents | Task completion, tool call accuracy, step count, recovery rate, cost per completed task | Evaluate intermediate actions, not only the final answer. |
| Code generation | Unit test pass rate, build success, lint errors, security findings | Prefer executable checks when possible. |
Create rubrics that produce actionable failures
Rubrics should help engineers fix prompts and workflows. A vague 1 to 5 “quality” score does not tell you what went wrong. A useful rubric names the criteria, defines each score, and includes examples.
Here is a compact rubric for grounded RAG answers:
| Score | Groundedness definition | Action |
|---|---|---|
| 5 | All factual claims are directly supported by retrieved sources. | Pass |
| 4 | Minor wording extrapolation, but no unsupported material claim. | Pass with review if high-risk domain |
| 3 | One unsupported claim that does not change the core answer. | Investigate |
| 2 | One or more unsupported claims that could mislead the user. | Fail |
| 1 | Answer mostly ignores or contradicts retrieved sources. | Fail and block release |
A judge prompt should ask for structured output. For example:
{
"score": 2,
"passed": false,
"failure_reason": "The answer states that refunds are available after 60 days, but the provided policy says refunds are limited to 30 days.",
"unsupported_claims": [
"Refunds are available after 60 days."
]
}Screenshot suggestion: judge rubric
Show a judge configuration screen with criteria for groundedness, policy compliance, and completeness. Include the score scale, pass threshold, and a sample failed judgment with the reason field expanded.
Calibrate LLM judges before trusting them
LLM judges are useful for open-ended outputs, but they are still model calls. They can be inconsistent, biased toward longer answers, overly forgiving, or too harsh on outputs that differ from their preferred wording.
If you use LLM-as-a-judge, calibrate it against a labeled set before using it as a release gate.
A basic calibration process:
- Collect 50 to 200 representative examples.
- Have domain reviewers label the expected score or pass/fail outcome.
- Run the LLM judge on the same examples.
- Compare agreement rate, false passes, and false fails.
- Revise the rubric and judge prompt.
- Freeze a judge version so metric trends stay comparable.
| Calibration metric | Target | What it tells you |
|---|---|---|
| Agreement with reviewer labels | 80% or higher for low-risk tasks, higher for regulated domains | Whether the judge generally matches your standard |
| False pass rate | Less than 5% for high-risk failures | Whether bad outputs slip through |
| False fail rate | Low enough to avoid blocking good releases | Whether the judge creates too much noise |
| Score stability | Same result on repeated runs at least 90% of the time | Whether the judge is consistent enough for trend tracking |
Do not use an uncalibrated judge as your only production quality signal. Pair it with deterministic checks, reviewer labels, user feedback, and operational metrics.
Build an eval dataset that reflects production
Your metrics are only as good as the examples they run on. Ten hand-picked examples can catch obvious regressions, but they will not tell you whether a system is ready for real traffic.
A useful eval dataset usually includes:
- Common cases: The top user intents or workflows by traffic.
- Edge cases: Ambiguous prompts, missing context, conflicting instructions, malformed inputs.
- High-risk cases: Refunds, medical advice, legal claims, account access, billing changes, security-sensitive actions.
- Regression cases: Past production failures that should never return.
- Adversarial cases: Prompt injection, policy bypass attempts, irrelevant context, tool misuse triggers.
As a starting point, use 100 to 300 examples for a meaningful offline eval of a single prompt or workflow. For high-risk agents, use more. For quick local iteration, keep a smaller smoke test of 20 to 50 examples that runs on every prompt change.
Screenshot suggestion: eval dataset
Show a dataset table with columns for input, expected behavior, reference answer, tags, risk level, source document, and past failure ID. Include tags such as “billing,” “policy,” “prompt-injection,” and “regression.”
Set thresholds before comparing versions
Define pass thresholds before you run a model bake-off or prompt comparison. If you set thresholds after seeing the results, you can easily rationalize a risky release.
Use thresholds that match the product risk. A casual brainstorming assistant can tolerate more variability than an agent that changes customer subscriptions.
| Metric | Example threshold | Release rule |
|---|---|---|
| Task success rate | At least 85% | Must improve by 3 percentage points or stay within 1 point of current production |
| Unsupported claim rate | Less than 2% | Any increase above threshold blocks release |
| JSON schema validity | At least 99.5% | Must not regress |
| P95 latency | Under 4 seconds | Must stay under budget for target route |
| Average cost per successful task | Under $0.03 | Must not increase by more than 10% without approval |
For agent workflows, track cost per successful task instead of cost per model call. A cheaper model that needs retries, extra tool calls, or escalation may cost more in practice.
Track metric trends, not one-off scores
A single eval run tells you whether one version passed on one dataset at one point in time. Production systems need trend tracking. Prompts change, model providers update behavior, traffic shifts, and retrieval corpora drift.
Track metrics by prompt version, model, dataset version, route, customer segment, and release. This makes regressions easier to isolate.
Screenshot suggestion: metric trends dashboard
Show a dashboard with task success rate, groundedness score, P95 latency, cost per request, and judge failure rate over the last 30 days. Include annotations for prompt releases and model changes.
Trend tracking also helps you catch slow degradation. For example, if retrieval recall drops after a documentation migration, answer quality may decline even though the generation prompt did not change.
Debug failures with traces
When an eval fails, engineers need the full context behind the score. Store the prompt version, input, expected output, retrieved documents, model response, tool calls, judge reasoning, latency, token usage, and cost.
A trace should answer these questions:
- Was the input routed to the right prompt or agent?
- Did retrieval return the needed context?
- Did the prompt include conflicting instructions?
- Did the model ignore the context or fail to follow the format?
- Did a tool call use the wrong arguments?
- Did retries or fallbacks change the final response?
Screenshot suggestion: failed response trace
Show a trace where a support assistant gives the wrong refund window. Include the user input, retrieved policy document, generated answer, judge failure reason, latency, token cost, and prompt version diff.
Common mistakes to avoid
Using only accuracy
Accuracy is useful for narrow tasks with clear labels. It is weak for open-ended generation, RAG, agents, and workflows with safety or cost constraints. Add metrics for groundedness, instruction following, latency, cost, and user outcome.
Trusting LLM judges without calibration
An LLM judge can make your eval suite faster, but it needs a rubric and a labeled comparison set. Track false passes carefully. False passes are usually more dangerous than false fails because they let bad behavior ship.
Optimizing for one metric
If you optimize only for answer completeness, the model may become verbose and slow. If you optimize only for cost, quality may drop. Use a small metric portfolio with clear tradeoff rules.
Ignoring latency and cost
A prompt that improves quality by 2 percentage points but doubles P95 latency may hurt the product. Measure latency and cost on every eval run, especially when comparing models or adding agent steps.
Evaluating on too few examples
Small datasets are good for fast iteration, but they can hide regressions. Keep a smoke test for development and a larger release eval for shipping decisions.
Choosing metrics that do not match user outcomes
A metric can be technically valid and still useless. BLEU, ROUGE, or cosine similarity may be poor fits when users care about policy compliance, correct tool use, or whether the task was completed.
A practical metric selection checklist
Use this checklist when adding or revising evals for a prompt, agent, or workflow:
- Write the user outcome in one sentence.
- List the top 5 failure modes seen in production or expected during launch.
- Pick 2 to 4 primary metrics tied to task success.
- Add guardrails for safety, grounding, format, latency, and cost.
- Create a dataset with common, edge, high-risk, and regression examples.
- Define pass thresholds before running comparisons.
- Calibrate any LLM judge against reviewer labels.
- Store traces for every failed eval.
- Track metric trends by prompt version, model, and dataset version.
- Review failures after each release and add new regression cases.
Example metric set for a RAG support assistant
| Metric | Type | Target | Scoring method |
|---|---|---|---|
| Answer resolves user issue | Primary | At least 85% | Reviewer label or calibrated judge |
| Groundedness | Primary | Average score at least 4.3 out of 5 | Calibrated judge with source context |
| Citation correctness | Guardrail | At least 95% | Source matching check plus spot review |
| Policy violation rate | Guardrail | Less than 1% | Rules plus judge review |
| Escalation correctness | Guardrail | At least 90% | Classification against expected route |
| P95 latency | Operational | Under 5 seconds | Trace data |
| Cost per resolved issue | Operational | Under $0.04 | Token and provider cost data |
This metric set is useful because each failure points to a likely fix. Poor groundedness may require retrieval changes or prompt constraints. Bad escalation correctness may require routing changes. High latency may require a smaller model, fewer retrieved chunks, caching, or shorter context.
Final rule: choose metrics you will act on
Every metric should support a decision. If nobody knows what to do when a metric changes, remove it or redefine it.
Good LLM metrics help your team answer practical questions:
- Is this prompt version safer or riskier than the current one?
- Did the new model improve task success enough to justify higher cost?
- Which failures came from retrieval, generation, tools, or routing?
- Can we ship this change, or should we block it?
When metrics match user outcomes, evals become part of the release process instead of a separate research task.
PromptLayer helps AI teams manage prompts, run evaluations, inspect traces, compare versions, and monitor quality over time. To start building a reliable eval workflow for your LLM application, create a PromptLayer account.