How to Design an LLM Eval Framework
An LLM eval framework should help your team answer a practical release question: is this prompt, model, retrieval setup, or agent workflow better for users than the current version?
If the answer depends on vibes, a few hand-picked examples, or a single aggregate score, your framework will fail in production. Good evals connect model behavior to real product outcomes: support tickets resolved, correct answers returned, workflows completed, harmful actions avoided, and users kept out of dead ends.
This guide covers how to design an eval framework for LLM-powered support bots, RAG search, agents, and workflow automation. It assumes you are building software that ships, changes often, and needs repeatable release checks.
Start with the product decision your eval must support
Before choosing metrics or graders, write down the decision the eval will inform. Good examples:
- Prompt release: Should we ship prompt version 42 for our support bot?
- Model change: Can we move from GPT-4.1 to a cheaper model without lowering answer quality?
- RAG change: Did the new chunking strategy improve retrieval for billing and account questions?
- Agent change: Does the new tool-calling plan complete refund workflows with fewer failed handoffs?
- Safety change: Does the assistant refuse requests that violate policy without blocking valid users?
Then define what “better” means in user terms. For a support bot, “better” may mean fewer reopened tickets and fewer wrong policy answers. For RAG search, it may mean the answer is grounded in the retrieved documents and includes the right citation. For an agent, it may mean the workflow finishes successfully without calling tools in the wrong order.
This is the core of LLM evaluation: measuring behavior against criteria that match the job your application needs to do.
Define eval layers instead of one giant score
A single score is easy to report and hard to trust. Most production LLM systems need multiple eval layers because failures happen at different points.
1. Component evals
Component evals test one part of the system in isolation. They are fast, cheap, and useful in CI.
- Prompt eval: Does the support bot classify the ticket category correctly?
- Retrieval eval: Does the RAG system return the correct policy document in the top 5 results?
- Extraction eval: Does the model extract the customer ID, order ID, and refund amount correctly?
- Tool selection eval: Does the agent choose the right tool for “cancel my subscription”?
2. End-to-end scenario evals
Scenario evals test the full user path. They cost more, but they catch problems component tests miss.
- A user asks a billing question with incomplete context.
- The assistant asks one clarifying question.
- The assistant retrieves the correct billing policy.
- The assistant gives an answer with a citation.
- The assistant escalates if the account status requires manual review.
3. Regression evals
Regression evals protect against failures you have already seen. Every serious production bug should become an eval case.
For example, if your assistant once told users they could get a refund after 90 days when the policy says 30 days, add that exact case and several variants to your regression suite.
4. Production monitoring evals
Offline evals cannot cover every user input. You also need production traces, sampled reviews, drift checks, and failure tracking. This is where LLM observability matters. You need to see prompts, model responses, retrieved documents, tool calls, latency, cost, and user outcomes in one place.
Build datasets that reflect real usage
Your eval framework is only as useful as its datasets. A 20-row spreadsheet can help during early prompt design, but it cannot protect a production system with thousands of daily users.
Use several dataset types:
- Golden set: Carefully reviewed examples with expected behavior. Keep this small enough to maintain, usually 100 to 500 examples per major task.
- Regression set: Real failures from production. Add to it every week.
- Adversarial set: Edge cases, confusing inputs, prompt injection attempts, policy traps, malformed requests, and ambiguous user messages.
- Production sample: Recent real traffic sampled from logs, with sensitive data removed or protected.
- Holdout set: A locked dataset that prompt writers and model tuners do not inspect during daily iteration.
A common mistake is using a stale dataset that no longer matches the product. If your support bot now handles enterprise billing, but your eval set only contains old password reset questions, your score will look stable while the product gets worse.
Set a refresh schedule. For example:
- Review production samples every week.
- Add new regression cases after every incident.
- Refresh 10% to 20% of the golden set each month.
- Audit the holdout set quarterly to remove outdated policies or invalid expected answers.
Choose metrics that match the task
Do not optimize for a metric because it is easy to compute. Optimize for a metric that maps to user success.
Support bot metrics
- Answer correctness: Did the response answer the user’s question accurately?
- Policy compliance: Did the response follow refund, privacy, security, and escalation rules?
- Escalation quality: Did the bot escalate when it lacked enough information?
- Resolution rate: Did the user avoid opening or reopening a ticket?
- Deflection safety: Did the bot avoid pretending to solve issues that need a human support agent?
RAG search metrics
- Retrieval recall@k: Did the correct document appear in the top 3, 5, or 10 results?
- Groundedness: Is the generated answer supported by retrieved content?
- Citation accuracy: Does the citation point to the source that supports the claim?
- Answer completeness: Did the answer include the required constraints, exceptions, or next steps?
- Abstention quality: Did the model say it could not answer when the corpus lacked support?
Agent metrics
- Task completion: Did the agent finish the workflow?
- Tool correctness: Did it call the right tools with the right arguments?
- Step efficiency: Did it avoid unnecessary calls and loops?
- Recovery behavior: Did it handle tool errors, missing fields, or permission failures correctly?
- State safety: Did it avoid making irreversible changes without confirmation?
For workflow automation, you may also need business-specific checks. An invoice processing agent might require 99% field extraction accuracy for invoice totals, while a sales email assistant may care more about tone, factual accuracy, and CRM field updates.
Use the right grader for each check
No grader works for every eval. Use a mix of deterministic checks, reference comparisons, model-based grading, and expert review.
Deterministic checks
Use code when the answer can be checked exactly.
- JSON schema validity
- Required field presence
- Tool name and argument matching
- Regex checks for forbidden content
- Numeric tolerance checks, such as extracted total within $0.01
These checks are cheap and reliable. Use them wherever possible.
Reference-based checks
Use reference answers when there is a known correct output or expected behavior. They work well for classification, extraction, routing, and many support tasks.
Be careful with exact string matching for generated answers. Two answers can use different wording and still be correct. In those cases, use rubrics instead of strict text equality.
LLM judges
LLM-as-a-judge can help grade open-ended responses, especially for criteria like helpfulness, groundedness, completeness, and instruction following. Treat judges as test automation, not truth.
Common judge mistakes include:
- Using vague rubrics like “rate the quality from 1 to 10.”
- Letting the judge see metadata that biases the result, such as which prompt version produced the answer.
- Using the same model family as both generator and judge without calibration.
- Trusting judge scores without checking agreement against expert labels.
- Failing to version the judge prompt.
For important evals, calibrate the judge. Take 100 to 200 examples, have domain reviewers label them, then measure judge agreement. If the judge disagrees often on refund policy, citation support, or safety refusal cases, fix the rubric before you trust the score.
Expert review
Use expert review for high-risk or high-value cases. You do not need reviewers to label every production trace. A small, steady sample is often enough.
For example, a fintech support assistant might route 50 conversations per week to compliance review. Those labels can update the golden set, calibrate the LLM judge, and catch policy drift.
Version prompts, datasets, models, and graders together
If you cannot reproduce an eval result, you cannot use it as a release gate.
Every eval run should record:
- Prompt version
- System prompt and developer instructions
- Model name and provider
- Model parameters, including temperature and max tokens
- Retrieval configuration, including index version, chunking strategy, and top-k
- Tool definitions and tool versions
- Eval dataset version
- Judge prompt version
- Code commit or application version
- Run timestamp
This matters when a score changes. If answer correctness drops from 91% to 84%, you need to know whether the prompt changed, the dataset changed, the judge changed, the retrieved documents changed, or the model provider changed behavior.
For prompt chains and agent workflows, version each step. If your architecture uses a planning step, retrieval step, synthesis step, and final policy check, track them separately. Teams working with compiled LLM workflows may also want to understand the role of an LLM compiler in structuring and optimizing multi-step execution.
Separate experiment data from production data
Do not mix experiments, test prompts, synthetic examples, and production traffic in one unlabelled dataset. It will pollute your metrics.
At minimum, separate:
- Development evals: Fast iteration while editing prompts.
- CI evals: Automated checks on every pull request or prompt change.
- Release evals: Larger test suites before deployment.
- Production monitoring: Real user traces and sampled labels after release.
- Research experiments: Synthetic data, new models, new tools, and exploratory tests.
Use tags, dataset IDs, and run IDs. If an engineer runs 200 synthetic adversarial examples against a draft prompt, those results should not affect the production quality dashboard.
Design release gates with thresholds and failure rules
An eval framework should make release decisions easier. Define gates before the eval runs.
Example release gate for a RAG support assistant:
- Answer correctness must be at least 88% on the golden set.
- Groundedness must be at least 95% on policy questions.
- Retrieval recall@5 must not drop by more than 2 percentage points.
- No critical policy failures are allowed in the regression set.
- Average latency must stay under 3 seconds for p50 and under 8 seconds for p95.
- Cost per conversation must not increase by more than 15% unless approved.
For agents, add workflow-specific gates:
- Tool argument validity must be at least 99%.
- Irreversible actions require explicit user confirmation in 100% of test cases.
- Loop rate must stay below 1%.
- Workflow completion must improve without increasing unsafe completions.
Avoid optimizing for a metric that does not match user outcomes. A chatbot can increase response length and score higher on “helpfulness” while making users wait longer and still failing to resolve their issue. A RAG system can improve retrieval recall@10 while still citing the wrong document in the final answer. Your gates should include both system-level and user-level checks.
Inspect failures, not just averages
Aggregate scores hide the failures that hurt users. Break results down by task, customer segment, language, policy area, document type, and workflow step.
For a support bot, review slices such as:
- Refund questions
- Account deletion requests
- Enterprise contract questions
- Angry or frustrated users
- Users with missing account context
- Non-English messages
For RAG, inspect:
- Questions where the correct document was retrieved but the answer was wrong
- Questions where the correct document was not retrieved
- Questions with outdated documents
- Questions where citations did not support the answer
For agents, inspect:
- Tool call failures
- Bad plans
- Repeated steps
- State mismatches
- Missing confirmation steps
Turn repeated failure types into new eval cases. This is how your framework improves over time.
Run evals in the development workflow
Evals work best when they run where engineering work happens.
- During prompt editing: Run a small smoke test of 20 to 50 examples.
- On pull request: Run component evals and key regression cases.
- Before release: Run the full golden set, regression set, and scenario tests.
- After release: Monitor production traces, user feedback, cost, latency, and sampled quality labels.
Keep fast evals fast. Developers will ignore a 45-minute check while editing a prompt. Use a small local or staging suite for iteration, then run larger suites before release.
A practical setup might look like this:
- 30-example smoke test in under 2 minutes
- 250-example CI test in under 15 minutes
- 2,000-example release suite before deployment
- 1% to 5% sampled production review after deployment
Treat evals as a maintained system
An eval framework is not a one-time project. It needs owners, review cycles, and cleanup.
Assign ownership for:
- Dataset quality
- Judge prompts and rubrics
- Release thresholds
- Production sampling
- Failure taxonomy
- Privacy and data handling
Review the framework after major product changes. If your app adds a new agent tool for issuing credits, your evals need cases for credit limits, approvals, permissions, and user confirmation. If your knowledge base changes, your RAG eval references may need updates.
A practical checklist
- Define the product decision each eval supports.
- Map metrics to user outcomes.
- Create separate golden, regression, adversarial, production sample, and holdout datasets.
- Use deterministic checks where possible.
- Calibrate LLM judges against expert labels.
- Version prompts, datasets, models, retrieval settings, tools, and judge prompts.
- Separate development, CI, release, production, and research data.
- Set release gates before running evals.
- Slice results by task and failure type.
- Add production failures back into the regression set.
The goal is not to create a perfect score. The goal is to create a repeatable engineering process that helps you ship LLM changes with fewer surprises.
PromptLayer helps teams manage prompt versions, run evals, trace LLM calls, compare releases, and connect production behavior back to the prompts and datasets that caused it. If you are building an eval framework for prompts, agents, RAG, or AI workflows, you can create a PromptLayer account and start tracking your work in one place.