How to Debug Failed LLM Evals
How to Debug Failed LLM Evals
Failed LLM evals are useful only if your team can explain them. A red test run should tell you whether the problem came from the prompt, model, retrieval layer, tool call, grader, dataset, threshold, or test setup. If it only tells you “pass” or “fail,” your eval suite is too hard to operate.
This guide gives you a practical debugging workflow for failed evals in LLM applications, agents, and prompt chains. Use it when a prompt change breaks tests, a model migration drops quality, an agent starts choosing the wrong tools, or an automated judge rejects outputs that look fine to your team.
Start by freezing the failed eval run
Before you edit prompts, rerun tests, or change thresholds, preserve the failed run exactly as it happened. You need a stable record to compare against later.
Capture these fields for every failed case:
- Eval run ID: a unique identifier for the full run.
- Dataset version: the exact set of test inputs used.
- Prompt version: system prompt, user template, tool instructions, and any chain-specific messages.
- Model and parameters: model name, temperature, top_p, max tokens, seed if available, and provider.
- Retrieved context: document IDs, chunks, scores, and final context sent to the model.
- Tool calls: arguments, return values, retries, and errors.
- Raw model output: before parsing, post-processing, or response formatting.
- Grader output: score, reasoning, rubric version, and pass or fail decision.
- Expected output or criteria: the reference answer, required fields, or grading rule.
If you cannot reconstruct these fields, fix logging before you debug deeper. A failed eval without trace data usually becomes guesswork. This is where LLM observability becomes part of the eval workflow, not a separate production concern.
Classify the failure before fixing it
Do not treat all failed evals the same way. A hallucinated answer, a strict grader, a missing retrieval chunk, and a malformed JSON response need different fixes.
Start with a simple failure taxonomy:
- Incorrect answer: the model gave the wrong fact, recommendation, classification, or action.
- Incomplete answer: the output missed required details, citations, fields, or constraints.
- Format failure: the answer was semantically fine but broke schema, JSON, XML, Markdown, or tool argument requirements.
- Instruction failure: the model ignored a system instruction, policy, style rule, or business constraint.
- Retrieval failure: the right source material was not retrieved or was buried under noisy context.
- Tool failure: the agent selected the wrong tool, passed bad arguments, or failed to recover from tool errors.
- Grader failure: the eval judged a good answer as bad or accepted a bad answer.
- Flaky failure: the same case passes and fails across repeated runs without a code or prompt change.
Add a failure category field to your eval results. After 50 to 100 failed cases, patterns usually become clear. For example, if 70% of failures are format failures, the next fix is probably schema validation and repair, not a new prompt rewrite.
Reproduce the failure in isolation
Rerun the smallest possible version of the failed case. Remove unrelated dataset rows, unrelated chain steps, and optional features. Your goal is to answer one question: can you make the same failure happen again under controlled conditions?
Run this sequence:
- Run the same input with the exact same prompt, model, parameters, retrieval results, and tools.
- If it fails again, rerun it 3 to 5 times with the same setup to check stability.
- If it passes on rerun, mark it as flaky and test with temperature 0 or the lowest available randomness setting.
- If it still flips between pass and fail, inspect the grader and output boundary cases.
For production-facing evals, repeated runs matter. A case that passes 6 out of 10 times is not stable enough for workflows like support automation, medical intake routing, financial summaries, or legal document review. Track pass rate per test case, not only suite-level score.
Inspect the input and expected behavior
Many eval failures start with unclear test cases. Before blaming the model, check whether the input and expected result are fair.
Look for these issues:
- Ambiguous input: the user request allows several valid answers.
- Outdated expected answer: the reference answer no longer matches current product behavior or source data.
- Hidden assumptions: the eval expects knowledge that is not present in the prompt or retrieved context.
- Overly narrow phrasing: the expected answer requires exact wording when multiple phrasings are acceptable.
- Conflicting criteria: the rubric rewards brevity but also requires a long explanation.
- Dataset leakage: the test case includes hints or labels that would never appear in production.
For example, an eval might expect the assistant to say “refunds are available within 30 days,” while the retrieved policy says “refund requests must be submitted within 30 days.” Those are close, but not identical. If the user asks whether they are guaranteed a refund, the stricter answer may be more correct.
Check whether the grader is the problem
Automated graders can fail too. This is especially common when you use an LLM judge for subjective criteria such as helpfulness, completeness, tone, or safety.
If you use LLM as a judge, inspect the judge prompt with the same rigor you apply to your application prompt. A vague grader creates noisy test results.
Debug the grader with these checks:
- Compare judge decisions against human review: sample 20 to 50 failed cases and estimate judge accuracy.
- Ask for structured judge output: require fields such as
score,pass,failure_reason, andmissing_requirements. - Separate criteria: score factual accuracy, format, citation use, and instruction following independently.
- Test positive and negative examples: include clearly good and clearly bad answers to verify grader behavior.
- Watch for verbosity bias: judges often reward longer answers even when the product needs concise output.
- Pin the judge model: changing the judge model can change historical pass rates.
A good grader explains the failure in a way an engineer can act on. “Answer is poor” is not enough. “Missing required citation to the cancellation policy and gives a 14-day window instead of 30 days” is useful.
Compare the failed run against the last passing run
When an eval starts failing after a change, compare the failed run with the last known passing run. Do this at the trace level.
Check these diffs first:
- Prompt diff: system prompt, developer instructions, examples, and variable interpolation.
- Model diff: provider, model version, context window, token limits, and parameters.
- Retrieval diff: embedding model, chunking, filters, top_k, reranking, and source documents.
- Tool diff: tool descriptions, argument schemas, auth changes, timeout behavior, and error handling.
- Parser diff: schema changes, JSON repair, fallback behavior, and validation rules.
- Dataset diff: added rows, changed expected outputs, removed metadata, or modified rubrics.
Small prompt changes can cause large behavior changes. For example, adding “be concise” to a support agent may cause it to omit required escalation steps. Changing top_k from 8 to 4 may remove the one document that contains the answer. Moving to a cheaper model may preserve average quality while breaking tool selection on edge cases.
Debug retrieval failures separately
For RAG systems, do not start by editing the answer prompt. First determine whether the model received the right information.
Ask these questions for each failed case:
- Was the required source document retrieved?
- Was the correct chunk retrieved, or only a nearby chunk?
- Did the relevant chunk rank high enough to survive context trimming?
- Did metadata filters exclude the right document?
- Did the retrieved context contain conflicting or outdated information?
- Did the prompt tell the model how to handle missing evidence?
A practical RAG eval should store retrieval metrics next to answer metrics. Track values such as recall@k, source precision, citation accuracy, and answer correctness. If recall@10 is low, improve retrieval before changing generation. If recall is high but answer correctness is low, inspect prompt instructions and context formatting.
Debug agent and tool failures step by step
Agent evals can fail for several reasons before the final answer is generated. The model may choose the wrong tool, call tools in the wrong order, pass malformed arguments, ignore a tool result, or stop too early.
Break agent evals into step-level checks:
- Planning: did the agent identify the correct task?
- Tool selection: did it pick the right tool for the job?
- Argument construction: did it pass valid and complete arguments?
- Tool result handling: did it read the returned data correctly?
- Recovery: did it retry, ask for clarification, or fail safely when a tool returned an error?
- Final response: did it answer the user using the tool result?
For example, if an agent fails a “change shipping address” eval, the final answer may look wrong because the agent never called the account verification tool. A final-output grader will catch the failure, but a step-level eval will tell you where it happened.
Separate prompt issues from model issues
When an eval fails, teams often switch models too quickly. First test whether the prompt gives the model enough structure to succeed.
Run a small matrix:
- Current prompt with current model.
- Current prompt with stronger model.
- Improved prompt with current model.
- Improved prompt with stronger model.
If the stronger model passes with the same prompt, your current model may lack reasoning ability, instruction following, or tool-use reliability for that task. If the improved prompt fixes the current model, the issue was likely prompt clarity. If all combinations fail, inspect retrieval, tools, dataset, or grading.
This approach keeps model upgrades honest. A larger model may hide unclear instructions during testing, then still fail on edge cases in production.
Look for flakiness and threshold problems
Some eval failures come from unstable outputs or unstable scoring. Treat flakiness as a product reliability problem, not a test annoyance.
Common causes include:
- High temperature on tasks that require deterministic answers.
- LLM judges with vague rubrics.
- Scores close to the pass threshold.
- Multiple valid answers with one narrow reference answer.
- Provider-side model updates.
- Race conditions in multi-step agent workflows.
For numeric evals, inspect the score distribution. If your threshold is 0.80 and many cases score between 0.78 and 0.82, your suite will produce noisy pass and fail results. Consider adding a review band, such as:
- Pass: score is 0.85 or higher.
- Manual review: score is 0.75 to 0.84.
- Fail: score is below 0.75.
This gives your team a cleaner signal and prevents borderline cases from blocking every release.
Use a debugging checklist for every failed eval
A consistent checklist helps your team avoid random fixes. Use this sequence during triage:
- Confirm the failure: rerun the exact case and check whether it reproduces.
- Classify it: answer, format, instruction, retrieval, tool, grader, dataset, or flaky failure.
- Inspect trace data: prompt, model call, retrieved context, tool calls, raw output, parsed output, and grader result.
- Check the expected behavior: verify the dataset row and rubric are still valid.
- Compare with a passing run: identify prompt, model, retrieval, tool, parser, or dataset changes.
- Test the smallest fix: avoid changing prompt, retrieval, model, and grader at the same time.
- Rerun affected evals: run the failed case, nearby cases, and a regression subset.
- Record the root cause: update the eval result with the failure category and fix.
This process turns failed evals into engineering data. Over time, you can report that 35% of failures come from retrieval gaps, 25% from schema issues, 20% from judge noise, and 20% from prompt behavior. That is much easier to act on than a single aggregate score.
Fix one layer at a time
Do not make five changes and rerun the suite. You may improve the score, but you will not know which change worked. This makes future failures harder to debug.
Use targeted fixes:
- Prompt failure: add explicit requirements, examples, output constraints, or refusal rules.
- Format failure: add schema validation, structured output mode, retries, or a repair step.
- Retrieval failure: adjust chunking, filters, embeddings, reranking, or context assembly.
- Tool failure: clarify tool descriptions, tighten schemas, add argument validation, or improve retry logic.
- Grader failure: rewrite the rubric, separate scoring dimensions, or add human-reviewed calibration examples.
- Dataset failure: update stale expected answers, split ambiguous cases, or add metadata for category-level reporting.
- Model failure: change model, use routing, or reserve a stronger model for high-risk cases.
After each fix, rerun the original failed cases and a regression slice. A good regression slice includes cases that previously failed, cases that are similar, and cases that should not change. For a support bot, that might mean 20 refund questions, 20 cancellation questions, 20 account update questions, and 20 unrelated policy questions.
Track eval health over time
Debugging gets easier when your eval system stores enough history. A mature LLM evaluation setup should help you answer these questions quickly:
- Which eval cases fail most often?
- Which prompts or chains cause the most regressions?
- Which failure categories are increasing?
- Which model changes improved quality and which created regressions?
- Which judge versions changed pass rates?
- Which dataset rows are noisy or outdated?
Version your prompts, datasets, graders, and eval runs together. If your team ships agents or multi-step chains, store each step in the trace. For complex pipelines, ideas related to an LLM compiler can help teams think about prompts, tools, and intermediate steps as a structured execution plan instead of one opaque model call.
A practical example
Say your team runs a 300-case eval suite for a customer support agent. The latest prompt version drops the pass rate from 91% to 83%.
A weak debugging process would start rewriting the prompt immediately. A stronger process would look like this:
- Freeze the failed run and compare it to the last passing run.
- Classify the 24 new failures.
- Find that 14 are missing escalation instructions, 6 are JSON format failures, and 4 are judge disagreements.
- Inspect the prompt diff and find that an example showing escalation behavior was removed.
- Restore a shorter escalation example and add an explicit rule: “If the user reports fraud, account lockout, or unauthorized access, escalate to a human support queue.”
- Add schema validation for the JSON failures.
- Update the judge rubric for the 4 disagreement cases after human review.
- Rerun the full suite and confirm the pass rate returns to 92% without reducing performance on other categories.
The key is that each failure type received a different fix. The team did not rely on one broad prompt change to solve unrelated problems.
What good eval debugging looks like
A strong eval debugging workflow has a few clear traits:
- Every failed case has a trace you can inspect.
- Every failure gets a category.
- Graders are tested and versioned.
- Retrieval and tool behavior are evaluated separately from final answers.
- Prompt, model, dataset, and judge changes are compared against previous runs.
- Flaky cases are measured, not ignored.
- Fixes are small enough to explain.
Failed evals are part of shipping LLM systems. The goal is not to avoid failures during development. The goal is to make each failure specific, reproducible, and useful enough to improve the system.
PromptLayer helps AI teams debug failed evals by connecting prompt versions, datasets, traces, model calls, and evaluation results in one workflow. If you are building LLM applications, agents, or prompt chains, you can create an account at https://dashboard.promptlayer.com/create-account.