How to Build a DeepEval LLM Eval Suite
How to Build a DeepEval LLM Eval Suite
A DeepEval eval suite gives your team a repeatable way to test LLM behavior before a prompt, model, retrieval change, or agent update reaches production. It should answer a simple engineering question: did this change make the application better, worse, or riskier?
For teams shipping LLM-powered products, evals need to cover more than “does the answer sound good?” A useful suite checks correctness, retrieval grounding, refusal behavior, tool use, latency, cost, and regressions tied to real production failures. If you are new to the broader concept, this LLM evaluation overview gives useful background.
This guide shows how to structure a DeepEval suite around real test cases, metric configs, CLI execution, and CI failures.
Start with the behavior you need to protect
Do not start by picking metrics. Start by listing the product behaviors that must stay stable.
For a customer support assistant, that might include:
- Answer refund policy questions using retrieved policy text.
- Refuse to invent order status when no order lookup tool result exists.
- Escalate billing disputes above $500.
- Keep answers under 150 words unless the user asks for detail.
- Return structured JSON when another service consumes the response.
- Stay under 3 seconds p95 latency for simple FAQ answers.
Each behavior should map to one or more test cases. Keep those test cases stable across prompt versions. If you rewrite the test cases every time the prompt changes, you are measuring your latest preference, not regression risk.
Install DeepEval and create a basic project
Install DeepEval in the same environment where you run your LLM tests:
pip install deepeval pytestA small eval suite can live next to your application tests:
app/
support_agent.py
evals/
datasets/
refund_policy_cases.json
test_refund_policy.py
test_refusals.py
test_latency_cost.py
pytest.iniYou can run DeepEval directly through its CLI or through pytest, depending on your team’s workflow. Many teams use both: local DeepEval runs during development, then pytest or DeepEval in CI before merge.
Define test cases as durable fixtures
A good eval test case should include the user input, the model output, expected behavior, and any retrieval context or tool output needed to judge the answer. Keep prompts out of the test case unless the prompt itself is the unit under test.
{
"id": "refund-policy-001",
"category": "refund_policy",
"input": "Can I get a refund after 45 days if the item is unopened?",
"retrieval_context": [
"Refunds are available within 30 days of purchase.",
"Unopened items after 30 days may be eligible for store credit.",
"Cash refunds are not available after the 30-day refund window."
],
"expected_behavior": {
"must_include": [
"30 days",
"store credit"
],
"must_not_include": [
"cash refund",
"full refund after 45 days"
],
"tone": "clear and concise"
}
}This structure gives you more than one way to evaluate the result. You can use a DeepEval metric for answer relevance or faithfulness, then add deterministic checks for specific phrases, JSON keys, or tool calls.
Write your first DeepEval test
Here is a minimal test that calls your app, builds a DeepEval test case, and checks answer relevancy plus faithfulness to retrieved context.
# evals/test_refund_policy.py
from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from deepeval.test_case import LLMTestCase
from app.support_agent import answer_support_question
def test_refund_policy_after_45_days():
user_input = "Can I get a refund after 45 days if the item is unopened?"
retrieval_context = [
"Refunds are available within 30 days of purchase.",
"Unopened items after 30 days may be eligible for store credit.",
"Cash refunds are not available after the 30-day refund window.",
]
actual_output = answer_support_question(
user_input=user_input,
context=retrieval_context,
)
test_case = LLMTestCase(
input=user_input,
actual_output=actual_output,
retrieval_context=retrieval_context,
)
metrics = [
AnswerRelevancyMetric(threshold=0.75),
FaithfulnessMetric(threshold=0.80),
]
assert_test(test_case, metrics)This test checks two separate concerns:
- Answer relevancy: Did the model answer the user’s question?
- Faithfulness: Did the model stay grounded in the provided context?
Keep those separate. One common mistake is mixing unrelated behaviors into one metric, such as “answer is relevant, polite, short, grounded, and uses no banned words.” When it fails, you will not know what broke.
Configure metrics with clear thresholds
Metric thresholds should come after a baseline run. Run your current production prompt against 50 to 200 representative examples, inspect the score distribution, then set thresholds that catch real regressions without blocking every harmless wording change.
# evals/metrics.py
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric, GEval
from deepeval.test_case import LLMTestCaseParams
answer_relevancy = AnswerRelevancyMetric(
threshold=0.75,
model="gpt-4o-mini",
include_reason=True,
)
faithfulness = FaithfulnessMetric(
threshold=0.80,
model="gpt-4o-mini",
include_reason=True,
)
support_policy_correctness = GEval(
name="Support Policy Correctness",
criteria=(
"Evaluate whether the answer correctly applies the support policy. "
"The answer should not promise refunds that are not supported by the context. "
"The answer should clearly explain the available option."
),
evaluation_params=[
LLMTestCaseParams.INPUT,
LLMTestCaseParams.ACTUAL_OUTPUT,
LLMTestCaseParams.RETRIEVAL_CONTEXT,
],
threshold=0.70,
model="gpt-4o",
)Use stronger judge models for subjective or policy-heavy checks. Use cheaper models for broad screening when the metric is less sensitive. If your evaluation depends on an LLM judge, read up on LLM-as-a-judge patterns and failure modes before you trust the score too much.
Add deterministic assertions where possible
LLM-based metrics are useful, but they should not replace simple checks. If your app must return JSON, validate JSON. If it must call a tool, assert the tool call. If it must stay below a cost limit, measure cost.
# evals/test_structured_output.py
import json
from app.support_agent import classify_ticket
def test_ticket_classifier_returns_required_json_keys():
output = classify_ticket("I was charged twice for my subscription.")
parsed = json.loads(output)
assert set(parsed.keys()) == {"category", "priority", "needs_human_review"}
assert parsed["category"] == "billing"
assert parsed["priority"] in {"low", "medium", "high"}
assert isinstance(parsed["needs_human_review"], bool)This kind of check is faster, cheaper, and more reliable than asking a judge model whether the JSON “looks valid.”
Build a dataset with coverage, not volume for its own sake
You do not need 10,000 examples to start. You do need enough examples to cover the ways your app can fail.
A practical first suite might include:
- 30 happy-path cases: Common requests your app should handle well.
- 30 edge cases: Ambiguous inputs, missing context, conflicting context, long user messages.
- 20 refusal cases: Requests the app should not answer directly.
- 20 regression cases: Real production failures, support escalations, or bugs found during QA.
- 10 latency and cost cases: Common flows that should stay fast and affordable.
Evaluating only happy paths is one of the fastest ways to get false confidence. Your eval suite should include the user behavior that broke the product last month, not only the demo flow that worked in a meeting.
Load test cases from a fixture file
Once you have more than a handful of cases, load them from JSON or YAML. This keeps your eval data reviewable in pull requests.
# evals/test_refund_policy_dataset.py
import json
from pathlib import Path
import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from app.support_agent import answer_support_question
from evals.metrics import faithfulness, support_policy_correctness
CASES_PATH = Path(__file__).parent / "datasets" / "refund_policy_cases.json"
@pytest.mark.parametrize("case", json.loads(CASES_PATH.read_text()))
def test_refund_policy_dataset(case):
actual_output = answer_support_question(
user_input=case["input"],
context=case["retrieval_context"],
)
test_case = LLMTestCase(
input=case["input"],
actual_output=actual_output,
retrieval_context=case["retrieval_context"],
)
assert_test(test_case, [faithfulness, support_policy_correctness])
for phrase in case["expected_behavior"].get("must_include", []):
assert phrase.lower() in actual_output.lower()
for phrase in case["expected_behavior"].get("must_not_include", []):
assert phrase.lower() not in actual_output.lower()This pattern gives you LLM scoring plus hard assertions in the same test.
Run the suite locally
Use local eval runs before you open a pull request. Developers should see failures while they still have the prompt and code in their editor.
$ deepeval test run evals/test_refund_policy.py
Running 12 test cases...
✓ refund-policy-001
Answer Relevancy: 0.89 passed threshold 0.75
Faithfulness: 0.92 passed threshold 0.80
✗ refund-policy-007
Answer Relevancy: 0.81 passed threshold 0.75
Faithfulness: 0.54 failed threshold 0.80
Reason:
The answer says the customer can receive a cash refund after 45 days,
but the provided policy only allows possible store credit after 30 days.
Summary:
Tests passed: 11
Tests failed: 1
Total cost: $0.38
Total runtime: 48.2sMake cost and runtime visible. Teams often ignore latency and cost until a prompt change doubles context size or adds extra model calls inside an agent loop. Add budget checks early.
Add latency and cost checks
DeepEval focuses on model behavior, but your eval suite should also protect operational constraints. You can wrap your application call and assert on timing, token usage, or estimated cost.
# evals/test_latency_cost.py
import time
from app.support_agent import answer_support_question_with_usage
def test_simple_faq_stays_under_latency_and_cost_budget():
start = time.perf_counter()
result = answer_support_question_with_usage(
user_input="What is your return window?",
context=["Refunds are available within 30 days of purchase."],
)
elapsed_seconds = time.perf_counter() - start
assert elapsed_seconds < 3.0
assert result.usage.total_tokens < 1200
assert result.estimated_cost_usd < 0.02You can tune these numbers by flow. A simple FAQ answer may need a 3 second limit. A multi-step research agent may need 30 seconds. The point is to make the budget explicit.
Separate prompt evals, retrieval evals, and agent evals
LLM applications fail in different layers. Your eval suite should make those layers visible.
- Prompt evals: Test whether the model follows instructions when given the right context.
- Retrieval evals: Test whether the system retrieves the right documents before generation.
- Agent evals: Test whether the model chooses the right tools and stops at the right time.
- End-to-end evals: Test the full user-visible behavior.
If your answer is wrong because retrieval returned the wrong policy, a generation metric alone will not tell you enough. Pair evals with traces and request logs. This is where LLM observability becomes practical: you need to see the prompt, context, model response, tool calls, latency, and cost for the failed case.
Use LLM judges carefully
LLM judges can score tone, relevance, groundedness, and policy adherence at scale. They also make mistakes. Treat judge scores as signals, not absolute truth.
Use these guardrails:
- Review at least 20 failed and 20 passed examples when you add a new judge metric.
- Compare judge scores against human labels for a small validation set.
- Keep judge prompts versioned.
- Do not use one vague judge metric for every behavior.
- Use deterministic assertions for schema, tool calls, banned strings, and numeric limits.
- Track score drift when you change the judge model.
Over-trusting LLM-as-judge scores is a common mistake. A score of 0.86 does not prove the answer is safe. It means the judge model scored it 0.86 under a specific rubric, prompt, and model version.
Set thresholds after a baseline
Thresholds without baselines create noisy CI. Before you enforce a threshold, run the suite against your current production prompt and record the distribution.
For example:
- Production prompt average faithfulness: 0.84
- Production prompt p10 faithfulness: 0.76
- Known bad prompt average faithfulness: 0.61
- Human-reviewed acceptable cases usually score above: 0.78
In that case, a faithfulness threshold of 0.80 may be reasonable for policy answers. A threshold of 0.95 would probably block good changes. A threshold of 0.50 would miss real regressions.
Add DeepEval to CI
Run a fast eval subset on every pull request. Run the full suite on a schedule or before larger releases. This keeps CI useful without making every small change wait 20 minutes.
# .github/workflows/evals.yml
name: LLM Evals
on:
pull_request:
paths:
- "app/**"
- "evals/**"
- "prompts/**"
workflow_dispatch:
jobs:
deepeval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install dependencies
run: |
pip install -r requirements.txt
pip install deepeval pytest
- name: Run eval suite
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
deepeval test run evals/test_refund_policy.pyRun deepeval test run evals/test_refund_policy.py
✗ refund-policy-007 failed
Metric: Faithfulness
Score: 0.54
Threshold: 0.80
Input:
Can I get a refund after 45 days if the item is unopened?
Actual output:
Yes, since the item is unopened, you can receive a cash refund after 45 days.
Retrieval context:
Refunds are available within 30 days of purchase.
Unopened items after 30 days may be eligible for store credit.
Cash refunds are not available after the 30-day refund window.
Reason:
The answer contradicts the refund policy by offering a cash refund after 45 days.
Error: Process completed with exit code 1.A CI failure should tell the developer what changed, which behavior broke, and where to inspect the failing trace. If it only says “eval failed,” developers will ignore it or rerun until it passes.
Version your prompts and eval datasets separately
Your eval cases should outlive individual prompts. A prompt may change weekly. A refund policy regression case should stay until the business rule changes.
Use this rule:
- Change the prompt when the implementation needs to improve.
- Change the eval case when the expected product behavior changes.
- Add a new eval case when production reveals a new failure mode.
This avoids the mistake of rewriting test cases every time a prompt changes. Your suite should create pressure toward better prompts, not adapt itself to every new prompt.
Track results over time
A single eval run tells you whether today’s change passed. A history of eval runs tells you whether the system is improving.
Track these fields for each run:
- Prompt version
- Model name and version
- Dataset version
- Metric names and versions
- Scores by category
- Failed test case IDs
- Latency p50 and p95
- Token usage and estimated cost
- Git commit SHA
This matters when a model provider changes behavior, a retrieval index updates, or a prompt chain starts passing different context between steps. If your application compiles or transforms prompts across steps, it can help to understand the role of an LLM compiler in prompt and workflow execution.
Common mistakes to avoid
Evaluating only happy paths
Happy paths are necessary, but they are not enough. Add ambiguous questions, missing context, adversarial phrasing, policy edge cases, and real production failures.
Using too few examples
Five examples can catch obvious failures, but they cannot represent a product surface. Start with 50 to 100 cases for one important workflow. Expand based on risk and traffic.
Over-trusting LLM-as-judge scores
Review judge decisions manually. Calibrate against human labels. Add deterministic checks wherever possible.
Setting thresholds without baselines
Run the current production system first. Use its score distribution to set thresholds. Otherwise, your CI may block good changes or miss bad ones.
Mixing unrelated behaviors in one metric
Use separate metrics for relevance, faithfulness, tone, format, tool use, and safety. Separate metrics make failures easier to debug.
Ignoring latency and cost
A prompt that improves faithfulness by 2 percent but triples cost may still be a bad release. Add timing and token checks to the suite.
Rewriting test cases every time a prompt changes
Keep eval cases tied to product behavior. Update them only when the expected behavior changes or when you add coverage for a new failure mode.
A practical rollout plan
- Pick one high-risk workflow. Start with support answers, policy QA, code generation, or tool-using agents.
- Create 50 representative cases. Include happy paths, edge cases, refusal cases, and known regressions.
- Add 2 to 4 metrics. Use separate metrics for separate behaviors.
- Run a production baseline. Record scores, latency, cost, and failure examples.
- Set thresholds. Use baseline data, not guesses.
- Add deterministic checks. Validate JSON, tool calls, required phrases, banned claims, latency, and cost.
- Run in CI. Start with a fast subset on pull requests.
- Add failures back into the dataset. Every serious production bug should become a regression test.
Final checklist
- Your test cases represent real user behavior.
- Your eval data is versioned and reviewed.
- Your metrics each test one clear behavior.
- Your thresholds come from baseline runs.
- Your suite includes edge cases and regressions.
- Your tests check latency and cost.
- Your CI output explains failures clearly.
- Your prompt changes do not require rewriting the eval suite.
A DeepEval suite works best when your team treats it as production infrastructure. Keep it close to your prompts, traces, datasets, and release process. The goal is not to get a perfect score. The goal is to catch risky changes before users do.
PromptLayer helps AI teams manage prompts, datasets, evaluations, traces, and production LLM behavior in one place. If you are building a DeepEval suite and want cleaner prompt versioning, observability, and eval workflows around it, create a PromptLayer account at https://dashboard.promptlayer.com/create-account.