How to Set Up AI Evaluation for LLM Apps
How to Set Up AI Evaluation for LLM Apps
AI evaluation for LLM apps should tell you whether your application is ready to ship, not whether a single demo looked good. A useful setup gives your team repeatable tests, clear scoring, versioned results, and production feedback that flows back into your development loop.
This guide walks through a practical setup for teams building chatbots, agents, copilots, RAG systems, classifiers, extraction pipelines, and other LLM-powered workflows. If you want a short definition first, PromptLayer also has a glossary entry on LLM evaluation, but this article focuses on implementation.
1. Start with the exact behavior you need to evaluate
Before you create a dataset or choose a scoring method, define the behavior you expect from the application. Keep this concrete. “Answer well” is too vague. “Answer billing questions using only approved policy docs and escalate refund requests over $500” is testable.
Write down the main job of the LLM workflow in one or two sentences:
- Support bot: Answer customer questions using the help center and create a ticket when confidence is low.
- Sales assistant: Draft account-specific outreach using CRM notes and approved product claims.
- Code agent: Modify a repository, run tests, and explain the change in a pull request summary.
- Data extraction pipeline: Convert invoices into a strict JSON schema with line items, totals, and vendor metadata.
Then define what can go wrong. This gives your eval suite a real target. For a RAG support bot, common failure modes might include:
- Inventing policy details that are not in the retrieved context
- Using stale documentation
- Answering when it should escalate
- Missing important constraints in the user question
- Returning a correct answer in a format your product cannot render
- Taking too long or costing too much per request
2. Build a small but realistic evaluation dataset
Your first eval dataset does not need 10,000 examples. It needs enough coverage to catch the failures that would hurt users. A strong starting point is 50 to 200 examples, depending on how complex the workflow is.
For each test case, store the inputs your app would receive at runtime. Depending on your app, that may include:
- User message
- Conversation history
- Retrieved documents or tool results
- User account attributes
- Expected output format
- Reference answer, if one exists
- Tags such as “refund,” “edge case,” “ambiguous,” “high risk,” or “tool required”
A balanced starter dataset for a support agent might look like this:
- 40 common happy-path questions
- 25 ambiguous questions that require clarification
- 25 questions where the answer is not in the knowledge base
- 20 policy-sensitive questions, such as refunds, cancellations, or account access
- 20 adversarial or unsafe requests
- 20 formatting and integration cases, such as JSON output or ticket creation
Many teams make the mistake of evaluating only happy paths. That produces high scores during development and weak reliability in production. Add difficult cases early, even if the app fails most of them at first. The failures show you where to improve.
3. Separate app prompts from reference answers
Do not leak expected answers into the prompt you are testing. If your app prompt contains the answer, your eval is measuring whether the model can repeat information, not whether the actual workflow works.
Keep these fields separate:
- Runtime input: What the real application receives
- Prompt template: The prompt, system message, tool instructions, and context formatting used by the app
- Reference answer: The expected answer or grading guide used only by the evaluator
- Scoring criteria: The rubric or checks used to decide pass, fail, or partial credit
This separation matters most when you use LLM-based graders. The judge can see the reference answer. The app being tested should not.
4. Write scoring criteria that a reviewer could apply consistently
Vague criteria lead to noisy results. “Good answer,” “helpful,” and “accurate” are too broad unless you define them. Use criteria that are specific enough for two reviewers to reach the same conclusion.
For example, replace this:
- Score 1 to 5 based on answer quality.
With this:
- Groundedness: Pass if every factual claim appears in the provided context. Fail if the answer adds unsupported policy details.
- Completeness: Pass if the answer addresses the user’s main question and includes required conditions, dates, limits, or next steps.
- Escalation: Pass if the assistant creates or recommends a ticket when the policy requires staff review.
- Format: Pass if the output matches the required schema and contains no extra text.
- Tone: Pass if the response is concise, professional, and avoids blame or unsupported promises.
For structured tasks, prefer deterministic checks where possible. If you expect JSON, validate the schema. If you expect a category label, compare it to an allowed list. If the answer must include a required field, check for it directly.
5. Use multiple evaluation methods
No single scoring method covers every risk. A practical eval suite usually combines deterministic checks, reference-based comparison, LLM grading, and reviewer sampling.
Deterministic checks
Use deterministic checks for anything that has a clear rule:
- Valid JSON
- Required keys present
- No extra fields
- Response under 300 words
- Tool call selected correctly
- Latency under 2 seconds for a route
- Cost under $0.03 per request
Reference-based checks
Use reference answers when there is a known correct response. This works well for extraction, classification, routing, and factual Q&A with fixed source material.
LLM-as-judge grading
An LLM judge can score open-ended answers, but you should treat the score as one signal. It can be inconsistent, biased by wording, or too forgiving. If you use LLM-as-a-judge, give it a clear rubric, hide implementation details it should not see, and test the judge against examples your team has already reviewed.
Reviewer sampling
Have engineers, product owners, or domain experts review a sample of outputs each week. This catches issues that automated evals miss, such as awkward phrasing, incomplete business logic, or a mismatch with product expectations.
6. Create a baseline before changing prompts or models
Run your current app against the full eval dataset before making changes. This gives you a baseline. Without it, you cannot tell whether a prompt edit improved the app or moved failures around.
Track at least these fields for each run:
- Prompt version
- Model name and provider
- Model parameters, such as temperature and max tokens
- Dataset version
- Retrieval configuration, if used
- Tool definitions, if used
- Pass or fail by criterion
- Raw model output
- Latency
- Token count and cost
- Trace ID or request ID
Do not look only at the average score. A model that improves average quality while failing refund-policy cases may be worse for your product. Track scores by tag so you can see regressions in specific slices.
7. Add cost and latency to the eval suite
Quality is not the only release gate. LLM apps often fail in production because they are too slow or too expensive at scale. Add cost and latency checks next to correctness checks.
Useful thresholds might include:
- p50 latency under 1.5 seconds for autocomplete
- p95 latency under 8 seconds for a multi-step agent
- Average cost under $0.02 for support classification
- Average cost under $0.10 for a RAG answer with citations
- Maximum 3 tool calls unless the task requires more
These numbers will vary by product. The important part is to make them explicit. If a prompt change improves answer quality by 2 percent but doubles cost, your team should see that before release.
8. Version prompts, models, and datasets together
LLM eval results are only useful when you can reproduce them. Version the prompt, model, dataset, and runtime configuration together. If you change the model without recording the dataset version, the result becomes hard to trust. If you update the dataset without recording the prompt version, comparisons become messy.
A clean eval record should answer these questions:
- Which prompt template produced this output?
- Which model and parameters were used?
- Which dataset version was tested?
- Which retrieval index or document snapshot was available?
- Which evaluator version scored the output?
- Which commit, release, or experiment created the run?
This is especially important for agentic systems, prompt chains, and workflows that call tools. A small change in one step can affect later steps. If your team works on multi-step generation pipelines, you may also care about concepts such as an LLM compiler, where prompt execution and workflow structure become easier to reason about as a system.
9. Connect evals to development and release workflows
Evals should run where your team makes decisions. Add them to the workflow developers already use.
A simple setup can look like this:
- A developer changes a prompt, retrieval setting, tool schema, or model.
- The team runs a small smoke eval of 20 to 50 high-signal cases.
- If it passes, the team runs the full eval dataset.
- The release is blocked if critical criteria fail.
- The team reviews regressions by tag before merging.
- After deployment, production traces feed new examples back into the dataset.
For high-risk workflows, use stricter gates. For example, a healthcare intake assistant might require zero failures on escalation cases. A marketing draft generator might allow some stylistic variance but block unsupported product claims.
10. Use production traces to improve the dataset
Your first eval dataset will miss real user behavior. Production traffic will show new phrasings, tool failures, retrieval gaps, and edge cases. This is where LLM observability becomes part of the evaluation loop.
Review production traces on a schedule. For many teams, a weekly review is enough at first. Pull examples into your eval dataset when they meet one of these conditions:
- A user gave negative feedback
- The model refused when it should have answered
- The model answered when it should have escalated
- A tool call failed or returned unexpected data
- The response was expensive or slow
- The app produced an output your UI, API, or downstream system could not use
Tag these examples by failure type. Over time, your dataset becomes a regression suite based on real product risk, not guesses made during the first build.
Common mistakes to avoid
Evaluating only happy paths
Happy-path tests are useful, but they do not prove readiness. Include ambiguous, adversarial, missing-context, and policy-sensitive examples.
Using vague criteria
If your rubric says “good,” “bad,” or “high quality,” rewrite it. Define what counts as correct, incomplete, unsafe, unsupported, or malformed.
Relying on vibe checks
Manual review helps, but unstructured review does not scale. Convert repeated reviewer comments into criteria and dataset examples.
Leaking expected answers
Keep reference answers out of the app prompt. Store them in the evaluator or dataset record instead.
Overtrusting LLM judge scores
LLM judges can be useful, but they need calibration. Compare judge output against reviewed examples, track judge changes, and avoid using a single score as the whole release decision.
Ignoring cost and latency
A slow, expensive prompt can pass quality tests and still fail in production. Track cost, token usage, latency, and tool-call count in every eval run.
Failing to version everything together
Prompt version, model version, dataset version, evaluator version, and retrieval configuration should travel together. Otherwise, your team will struggle to reproduce results.
A practical first-week setup
If your team is starting from scratch, use this plan:
- Day 1: Define the app behavior, top failure modes, and release-blocking criteria.
- Day 2: Create 50 to 100 evaluation examples, with at least 40 percent edge cases.
- Day 3: Add deterministic checks for schema, routing, tool calls, latency, and cost.
- Day 4: Add reference answers and an LLM judge rubric for open-ended outputs.
- Day 5: Run a baseline, review failures, tag regressions, and set release gates.
After that, treat evaluation as part of normal development. Every meaningful prompt, model, retrieval, or tool change should produce a comparable eval run. Every serious production failure should become a future test case.
Final checklist
- You have a realistic dataset with happy paths and hard cases.
- Your scoring criteria are specific and repeatable.
- Your app prompts do not contain hidden reference answers.
- You combine deterministic checks, reference checks, LLM grading, and reviewer sampling.
- You track quality, cost, latency, and tool behavior.
- You version prompts, models, datasets, evaluators, and runtime settings together.
- You connect evals to CI, release review, and production tracing.
- You add real production failures back into the dataset.
Good AI evaluation is a workflow, not a one-time test. Start small, make the criteria clear, and keep tightening the loop between development, evaluation, and production behavior.
PromptLayer helps AI teams manage prompts, run evaluations, trace LLM requests, version datasets, and monitor production behavior in one place. To start building a more reliable evaluation workflow, create a PromptLayer account.