Back

How to Build Prompt Evals for LLM Apps

May 29, 2026
How to Build Prompt Evals for LLM Apps

How to Build Prompt Evals for LLM Apps

Prompt evals are tests that measure whether your LLM application behaves correctly across real inputs, edge cases, model changes, prompt updates, and production traffic. If you are shipping an LLM feature, evals should sit next to your prompts, traces, datasets, and release process.

A good prompt eval answers practical engineering questions:

  • Did this prompt change improve quality or make it worse?
  • Does the app still follow required output formats?
  • Can the model handle common user requests and known failure cases?
  • Does retrieval, tool use, or prompt chaining break under specific inputs?
  • Can we safely upgrade from one model version to another?

Without evals, teams usually rely on manual spot checks. That works for a demo. It breaks down when your app has hundreds of prompts, multiple environments, user-specific context, agents, tools, and frequent model changes.

Start with the behavior you need to protect

Before writing test cases, define the behavior your application must preserve. Do this in plain language first.

For example, a support ticket classifier might need to:

  • Return valid JSON every time.
  • Choose exactly one category from a fixed list.
  • Escalate billing disputes above $500.
  • Avoid making refund promises.
  • Ask for missing account details when needed.

Those requirements become eval criteria. If you skip this step, you risk building evals that score general answer quality but miss the failures that actually hurt your product.

Define the unit you are evaluating

Be specific about what the eval covers. LLM apps often have several layers, and each layer needs a different test strategy.

  • Prompt-level evals: Test one prompt with fixed inputs and expected behavior.
  • Chain-level evals: Test a multi-step workflow where one model call feeds another. This is common in prompt chaining.
  • Retrieval evals: Test whether the right context was retrieved and whether the model used it correctly.
  • Tool-use evals: Test whether an agent chose the right tool, passed correct arguments, and handled the response.
  • End-to-end evals: Test the full user-facing workflow, including routing, context, tools, and final response.

Start narrow. A prompt-level eval is faster to build and easier to debug than a full end-to-end agent eval. Add broader tests once the core prompt behavior is stable.

Build a test dataset from real usage

Your eval dataset should reflect the inputs your app actually receives. Synthetic examples are useful at the start, but production traces usually expose the hard cases.

A strong eval dataset includes:

  • Happy path examples: Common requests your app should handle easily.
  • Edge cases: Ambiguous, incomplete, long, or unusual inputs.
  • Regression cases: Inputs that caused past bugs.
  • Adversarial cases: Prompt injection attempts, conflicting instructions, or malformed data.
  • High-value cases: Inputs tied to revenue, safety, compliance, or customer trust.

For most teams, a useful starting dataset is 50 to 200 examples. You do not need 10,000 examples to catch obvious regressions. If your app handles many intent types, build smaller datasets per task. For example, use 75 examples for ticket classification, 100 for answer generation, and 50 for escalation detection.

Choose the right eval type

Different LLM behaviors need different scoring methods. You will usually combine several eval types instead of relying on one score.

Exact match evals

Use exact match when the output has a fixed answer.

Good use cases:

  • Classification labels
  • Boolean decisions
  • Enum selection
  • Required routing choices

Example:

  • Input: “I was charged twice for my subscription.”
  • Expected: billing_issue
  • Pass condition: Output equals billing_issue

Schema validation evals

Use schema validation when your app depends on structured output. This catches broken JSON, missing fields, incorrect types, and invalid enum values.

Example checks:

  • Output is valid JSON.
  • priority is one of low, medium, or high.
  • confidence is a number between 0 and 1.
  • next_action is present for every escalation case.

Schema evals should be table stakes for production LLM workflows. They are cheap, deterministic, and catch failures that can break downstream code.

Reference-based evals

Use reference-based evals when there is a known good answer, but the wording may vary.

For example, a retrieval QA system might need to answer: “What is the refund window for annual plans?” The expected answer may be “30 days,” but the model could phrase it several ways. You can score whether the response contains the correct policy, cites the right source, and avoids unsupported claims.

LLM-as-judge evals

Use an LLM judge when quality is subjective or hard to score with rules. This works well for tone, completeness, instruction following, and groundedness, but it needs careful setup.

A judge prompt should include:

  • The task definition
  • The input
  • The model output
  • Any reference answer or retrieved context
  • A scoring rubric
  • A required output format

Use a small numeric scale, such as 1 to 5, and define each score clearly. Avoid vague rubrics like “rate the answer quality.” A better rubric is: “Score 5 if the answer fully resolves the user request using only the provided context. Score 3 if it is partially correct but missing a key condition. Score 1 if it contains unsupported claims or contradicts the provided context.”

Pairwise evals

Use pairwise evals when comparing two prompt versions, model versions, or retrieval strategies. The judge chooses which output is better for the same input.

This is useful during prompt iteration because it asks a simpler question: “Is candidate B better than the current production version?” You can run this across 100 examples and inspect win rate, tie rate, and failure patterns.

Write evals around failure modes

Many teams build evals around ideal outputs. Production failures usually come from specific failure modes. Write evals that target those failures directly.

Common LLM app failure modes include:

  • Format drift: The model stops returning valid JSON or adds extra prose.
  • Instruction conflict: The model follows user text instead of system or developer instructions.
  • Ungrounded claims: The model invents details not present in retrieved context.
  • Over-refusal: The model refuses safe requests.
  • Under-refusal: The model answers requests it should reject.
  • Bad tool choice: The agent calls the wrong tool or skips a required tool.
  • Bad tool arguments: The tool is right, but the parameters are wrong.
  • Context misuse: The model ignores important retrieved content or uses stale context.

If you use dynamic context injection, retrieval, or user-specific memory, include evals for prompt augmentation. A prompt that works with clean context may fail when the added context is long, noisy, outdated, or partially conflicting. See prompt augmentation for the core pattern.

Create a scoring plan before you run tests

Decide how you will interpret results before you compare prompt versions. Otherwise, you may cherry-pick metrics that make a weak change look acceptable.

For each eval suite, define:

  • Primary metric: The main pass or quality score.
  • Blocking checks: Failures that prevent release, such as invalid JSON or policy violations.
  • Minimum threshold: The score required to ship.
  • Regression limit: The maximum acceptable drop from production.
  • Segment breakdowns: Scores by intent, language, customer type, or workflow step.

Example scoring plan for a support triage prompt:

  • Classification accuracy must be at least 92%.
  • Escalation recall must be at least 98% for high-risk cases.
  • JSON validity must be 100%.
  • No refund promise violations are allowed.
  • The new prompt must beat or tie production on at least 60% of pairwise comparisons.

This turns evals into a release gate instead of a loose review process.

Version prompts and eval datasets together

Prompt evals are more useful when you can connect every result to the exact prompt version, model, parameters, dataset, and code path that produced it. If you cannot reproduce a result, it is hard to trust the score.

Track these fields for every eval run:

  • Prompt name and version
  • Model and provider
  • Temperature and other generation parameters
  • Dataset version
  • Evaluator version
  • Application environment
  • Timestamp
  • Trace or request ID

This is where prompt management becomes important. Prompts change often, and small edits can create large behavior shifts. Treat prompts like application code: version them, review them, test them, and connect them to production behavior.

Use a simple eval loop

A practical prompt eval workflow has five steps:

  1. Collect examples: Pull real inputs from production traces, support logs, QA sessions, or synthetic generation.
  2. Label expected behavior: Add references, rubrics, tags, and known failure types.
  3. Run the current prompt: Establish a baseline against production or the last approved version.
  4. Test a candidate prompt: Compare the new version against the baseline.
  5. Review failures: Inspect bad cases, update the prompt or dataset, and rerun.

Keep the loop fast. If an eval run takes two hours and requires manual spreadsheet work, engineers will avoid it. For many prompt updates, your core eval suite should run in under 10 minutes.

Tag examples so failures are easier to debug

Aggregate scores can hide important regressions. A prompt might improve from 86% to 90% overall while getting worse on enterprise billing cases. Tags help you see that.

Useful tags include:

  • Intent, such as billing, technical_support, or sales
  • Risk level, such as low_risk or high_risk
  • Input shape, such as long_context, missing_info, or multi_intent
  • Language or region
  • Customer segment
  • Known failure mode

After each eval run, review scores by tag. This gives you a clearer view of what changed and where to focus your next prompt edit.

Test prompts against realistic context

A prompt does not run in isolation in production. It receives retrieved documents, user metadata, chat history, tool outputs, system instructions, and sometimes results from earlier model calls. Your evals should include that context.

For a retrieval-based assistant, store these fields in each test case:

  • User question
  • Retrieved passages
  • Expected answer
  • Source document IDs
  • Allowed claims
  • Claims the model must avoid

For an agent workflow, store:

  • User goal
  • Available tools
  • Expected tool calls
  • Expected tool arguments
  • Mocked tool responses
  • Expected final answer

This makes evals closer to your real application and reduces false confidence from isolated prompt tests.

Separate deterministic checks from judgment calls

Do not ask an LLM judge to check things that code can check more reliably. Use deterministic checks whenever possible.

Good deterministic checks include:

  • JSON validity
  • Presence of required fields
  • Valid enum values
  • String length limits
  • Forbidden phrases
  • Required citations
  • Tool name and argument validation

Use LLM judges for criteria like helpfulness, groundedness, tone, and completeness. Even then, review samples of judge decisions. Judge prompts can drift, disagree with domain experts, or reward fluent but incorrect answers.

Make evals part of your release process

Prompt evals should run before important changes ship. Run them when you:

  • Edit a system prompt or task prompt
  • Change a model or model version
  • Modify retrieval settings
  • Add or change tools
  • Update output schemas
  • Change routing logic
  • Launch a new agent workflow

For high-risk workflows, add evals to CI or a pre-release checklist. For lower-risk workflows, run scheduled evals daily or weekly against sampled production data.

A reasonable release gate might look like this:

  • All schema checks pass.
  • No critical policy cases fail.
  • Overall score does not drop by more than 1%.
  • High-risk segment score stays above 95%.
  • Pairwise win rate is above 55% against production.

Use production traces to keep evals current

Eval datasets get stale. Users change how they ask questions, your product changes, your docs change, and models behave differently over time.

Create a process for adding new examples from production. Each week, review a sample of traces and add cases that meet one of these criteria:

  • The model failed in a new way.
  • The input represents a common user pattern.
  • The case is high-risk or high-value.
  • The output required manual correction.
  • The case exposed a gap in your current dataset.

If you only add failures, your dataset may become too skewed. Keep a balance of common happy-path cases and difficult regression cases.

A concrete example: evals for a customer support answer prompt

Assume you have an LLM feature that answers customer support questions using retrieved help center articles.

Task

Answer the user’s question using only the provided help center context. If the context does not contain the answer, say you do not have enough information and ask a clarifying question.

Dataset fields

  • user_question
  • retrieved_context
  • expected_answer
  • must_include
  • must_not_include
  • source_ids
  • tags

Eval checks

  • Schema: Response includes answer, citations, and confidence.
  • Groundedness: The answer uses only the provided context.
  • Completeness: The answer covers all required conditions.
  • Citation accuracy: Citations refer to the correct source IDs.
  • Refusal behavior: The model says it does not know when context is insufficient.

Release thresholds

  • JSON validity: 100%
  • Citation accuracy: at least 98%
  • Groundedness score: at least 4.5 out of 5
  • No unsupported pricing or policy claims
  • No regression on top 20 production questions

This setup gives your team a clear signal before changing the prompt, retrieval settings, or model.

Common mistakes to avoid

  • Only testing easy examples: Your prompt may look strong while failing real production cases.
  • Using one generic quality score: Break quality into concrete checks, such as correctness, format, groundedness, and safety.
  • Ignoring output contracts: If downstream code expects JSON, schema validity is a core quality metric.
  • Mixing too many tasks in one eval: Separate classification, generation, routing, retrieval, and tool-use tests.
  • Changing the prompt and dataset at the same time: You will struggle to know what caused the score change.
  • Trusting LLM judges without review: Sample judge outputs regularly and compare them with expert labels.
  • Not saving failed cases: Every production failure should be a candidate regression test.

How many evals do you need?

The right number depends on risk and complexity. Use these ranges as a starting point:

  • Simple prompt with structured output: 25 to 75 examples
  • Classification or routing prompt: 100 to 300 examples across labels
  • Retrieval QA workflow: 100 to 500 examples across document types
  • Tool-using agent: 50 to 200 scenario-based examples
  • High-risk workflow: 500 or more examples, with strong segment coverage

Coverage matters more than raw count. A 100-example dataset with well-labeled edge cases is usually better than a 1,000-example dataset full of near-duplicates.

Connect evals to prompt development

Prompt evals work best when they are part of how your team builds prompts, not a separate QA step at the end. When an engineer edits a prompt, they should be able to run the relevant eval suite, compare results against production, inspect traces, and decide whether the change is safe.

As your system grows, you may also need compiler-style thinking for LLM workflows: breaking tasks into steps, checking intermediate outputs, and optimizing the full chain instead of one prompt at a time. The LLM compiler concept is useful here because it frames prompts, tools, and intermediate representations as parts of a larger execution plan.

Final checklist for prompt evals

  • Define the behavior your app must protect.
  • Choose the unit under test: prompt, chain, retrieval, tool call, or full workflow.
  • Build a dataset from real usage and known failure cases.
  • Use deterministic checks for formats, schemas, and tool arguments.
  • Use LLM judges for subjective criteria with clear rubrics.
  • Tag examples by intent, risk, and failure mode.
  • Version prompts, datasets, models, and evaluators.
  • Set release thresholds before running comparisons.
  • Review failures, add regression cases, and rerun.
  • Refresh datasets with production traces over time.

Prompt evals turn prompt changes into an engineering process. They help you ship faster because you can see what changed, what broke, and whether a new version is ready for users.


PromptLayer helps AI teams manage prompt versions, run evals, trace LLM behavior, and connect production data back into development. Create an account at https://dashboard.promptlayer.com/create-account to start testing and improving your LLM prompts with a more reliable workflow.

The first platform built for prompt engineering