How to Run LLM Evals in CI
How to Run LLM Evals in CI
LLM evals belong in CI when your prompts, models, retrieval logic, tools, or agent workflows affect production behavior. A small prompt edit can break JSON formatting, increase latency, change tool calls, or make a support agent answer outside policy. CI gives your team a repeatable checkpoint before those changes merge.
The goal is not to turn evals into a perfect truth machine. The goal is to catch likely regressions early, compare changes against a known baseline, and give reviewers better release signals.
What to evaluate in CI
Start with the parts of your LLM system that break most often:
- Prompt templates: instructions, examples, output schema, tone, refusal behavior, and task boundaries.
- Retrieval changes: chunking, ranking, filters, rerankers, metadata, and context packing.
- Model changes: model version, temperature, max tokens, tool support, and provider routing.
- Agent workflows: tool selection, argument quality, stopping conditions, retry behavior, and error handling.
- Structured outputs: JSON validity, schema conformance, enum values, and required fields.
- Safety and policy behavior: disallowed content, privacy constraints, escalation paths, and refusal consistency.
If you are new to the topic, start with a basic definition of LLM evaluation, then build a CI process around the behaviors your product depends on.
Use a small, high-signal eval suite for CI
Do not run every eval on every pull request. CI needs to be fast enough that developers respect it. A useful first target is:
- 20 to 100 test cases for pull requests.
- 200 to 1,000 test cases for nightly or pre-release runs.
- 5 to 20 adversarial cases for critical workflows such as refunds, medical disclaimers, financial advice, or permission changes.
Your PR eval suite should cover the highest-risk and most frequently used paths. Include happy paths, but do not stop there. Add cases for missing context, ambiguous requests, malformed inputs, conflicting instructions, empty retrieval results, tool failures, and user attempts to bypass constraints.
Example eval dataset
[
{
"id": "support_refund_001",
"input": "I bought this 45 days ago. Can I get a refund?",
"expected_behavior": "Explain that the standard refund window is 30 days and offer escalation if policy exceptions apply.",
"tags": ["support", "policy", "refund"]
},
{
"id": "json_schema_001",
"input": "Extract name and renewal date: Acme Corp renews on March 4, 2026.",
"expected_json": {
"company_name": "Acme Corp",
"renewal_date": "2026-03-04"
},
"tags": ["structured-output"]
},
{
"id": "retrieval_empty_001",
"input": "What is our policy for enterprise crypto custody?",
"expected_behavior": "Say the policy is not available in the provided context and avoid inventing details.",
"tags": ["rag", "no-context", "hallucination"]
}
]Choose eval types that match the failure mode
Do not use one scoring method for everything. Different failures need different checks.
| Failure mode | Best eval type | Example check |
|---|---|---|
| Invalid JSON | Deterministic test | Parse output and validate against a JSON Schema. |
| Wrong classification | Exact match or label match | Expected label is billing_issue. |
| Bad tool call | Programmatic assertion | Tool name must be create_ticket with required fields. |
| Hallucinated answer | Reference-based judge plus retrieval checks | Answer must be supported by provided context. |
| Poor helpfulness | LLM judge with a specific rubric | Score clarity, completeness, and policy compliance separately. |
| Latency regression | Performance metric | P95 latency must stay under 4 seconds. |
| Cost regression | Token and provider cost tracking | Average cost per run must not increase by more than 15%. |
Use LLM judges carefully
An LLM judge is useful for subjective checks, but it should not be your only gate. Judge models can be inconsistent, biased toward longer answers, and sensitive to rubric wording. Use them as one signal in a broader eval setup.
If you use LLM-as-a-judge, make the rubric specific. Avoid vague criteria like “good answer” or “high quality.”
Weak rubric
Rate the answer from 1 to 5 based on quality.Better rubric
You are grading a customer support answer.
Score each criterion as pass or fail:
1. Policy accuracy:
- Pass if the answer follows the refund policy in the reference.
- Fail if it invents a refund option or contradicts the policy.
2. Completeness:
- Pass if the answer gives the user a clear next step.
- Fail if it only says "no" without guidance.
3. Grounding:
- Pass if all policy claims are supported by the reference.
- Fail if any unsupported policy detail appears.
Return JSON:
{
"policy_accuracy": "pass|fail",
"completeness": "pass|fail",
"grounding": "pass|fail",
"reason": "short explanation"
}For important releases, use two or three judge models and compare agreement. If one judge fails a case and two pass it, mark the result as “review needed” instead of blocking the merge automatically.
Set release gates that tolerate noise
LLM evals are noisy. If your CI blocks every pull request because one subjective score moved from 4.2 to 4.1, developers will ignore the system or work around it.
Use gates that focus on meaningful regression:
- Hard fail: JSON schema validity drops below 100% for required structured outputs.
- Hard fail: Safety or policy tests fail on critical cases.
- Soft fail: Judge score drops by more than 5% against the baseline.
- Soft fail: Cost per request increases by more than 20%.
- Soft fail: P95 latency increases by more than 25%.
- Review required: New failures appear in high-priority tags such as
billing,compliance, oragent-tools.
Treat scores as release signals, not absolute truth. A lower score may be acceptable if the change fixes a more important product issue. A higher score may still be unsafe if it hides a policy regression.
Compare against a baseline
Run evals against both the current branch and a baseline, usually the latest production prompt or main branch. This catches regressions more reliably than judging a branch in isolation.
{
"baseline": {
"prompt_version": "support-agent@prod",
"model": "gpt-4.1-mini",
"pass_rate": 0.91,
"avg_cost_usd": 0.018,
"p95_latency_ms": 3100
},
"candidate": {
"prompt_version": "support-agent@pr-184",
"model": "gpt-4.1-mini",
"pass_rate": 0.88,
"avg_cost_usd": 0.022,
"p95_latency_ms": 4200
}
}In this example, the candidate loses 3 percentage points of pass rate, raises cost by 22%, and raises P95 latency by 35%. Even if the answers look better in a few examples, this change should get reviewed before merge.
Example CI workflow with GitHub Actions
A simple CI setup has four steps:
- Install dependencies.
- Run the eval suite against the changed prompt, chain, or agent.
- Compare results against the baseline.
- Fail, warn, or comment on the pull request.
name: LLM Evals
on:
pull_request:
paths:
- "prompts/**"
- "agents/**"
- "evals/**"
- "src/llm/**"
jobs:
evals:
runs-on: ubuntu-latest
timeout-minutes: 20
steps:
- name: Check out repo
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install dependencies
run: |
pip install -r requirements.txt
- name: Run PR eval suite
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
PROMPTLAYER_API_KEY: ${{ secrets.PROMPTLAYER_API_KEY }}
run: |
python evals/run_ci_evals.py \
--suite evals/ci.json \
--baseline production \
--candidate pr \
--output eval-results.json
- name: Check gates
run: |
python evals/check_gates.py \
--results eval-results.json \
--min-pass-rate 0.90 \
--max-cost-increase 0.20 \
--max-p95-latency-increase 0.25Example eval runner structure
Your eval runner should separate test data, model calls, grading, and gate logic. That keeps the system easier to debug when a CI run fails.
# evals/run_ci_evals.py
import json
import time
from jsonschema import validate
def run_model(prompt, test_case):
# Replace with your model provider or prompt platform call.
return call_llm(
prompt=prompt,
input=test_case["input"],
temperature=0
)
def grade_schema(output, schema):
try:
validate(instance=json.loads(output), schema=schema)
return {"pass": True, "reason": "valid schema"}
except Exception as error:
return {"pass": False, "reason": str(error)}
def grade_expected_json(output, expected):
try:
parsed = json.loads(output)
return {
"pass": parsed == expected,
"reason": "matches expected JSON" if parsed == expected else "JSON mismatch"
}
except Exception as error:
return {"pass": False, "reason": str(error)}
def run_case(prompt, test_case):
start = time.time()
output = run_model(prompt, test_case)
latency_ms = int((time.time() - start) * 1000)
if "expected_json" in test_case:
grade = grade_expected_json(output, test_case["expected_json"])
elif "schema" in test_case:
grade = grade_schema(output, test_case["schema"])
else:
grade = grade_with_llm_judge(output, test_case)
return {
"id": test_case["id"],
"pass": grade["pass"],
"reason": grade["reason"],
"latency_ms": latency_ms,
"output": output,
"tags": test_case.get("tags", [])
}
def main():
with open("evals/ci.json") as file:
test_cases = json.load(file)
prompt = load_candidate_prompt()
results = [run_case(prompt, case) for case in test_cases]
with open("eval-results.json", "w") as file:
json.dump({"results": results}, file, indent=2)
if __name__ == "__main__":
main()Track cost and latency in the same run
A prompt change that improves answer quality by 2% but doubles cost may be a bad trade. A retrieval change that adds 3 seconds to P95 latency may hurt the user experience even if judge scores improve.
Capture at least these fields for every eval run:
- Prompt version
- Model name and version
- Input tokens
- Output tokens
- Estimated cost
- Latency
- Tool calls
- Retrieved document IDs
- Pass or fail result
- Judge explanation or assertion failure
This is where LLM observability becomes useful. When a CI run fails, you need to inspect the exact prompt, inputs, context, outputs, judge result, and metadata behind the failure.
Refresh eval datasets regularly
Eval datasets go stale when your product, users, policy, or retrieval corpus changes. A suite built three months ago may miss the bugs your users hit today.
Use a simple maintenance loop:
- Weekly: Add 5 to 10 production failures or support escalations to the eval dataset.
- Before major launches: Add cases for new features, new tools, and new policy paths.
- After incidents: Add regression tests that would have caught the issue.
- Monthly: Remove duplicate cases and fix outdated expected answers.
Keep your CI suite small, but keep the source dataset larger. Tag cases by feature, risk level, and failure type so CI can run the right subset.
{
"id": "agent_tool_017",
"risk": "high",
"tags": ["agent", "tool-call", "billing", "ci"],
"input": "Cancel the customer's subscription and refund the last invoice.",
"expected_behavior": "Do not take billing action without account verification and explicit confirmation."
}Run different eval suites at different stages
Use staged evals so developers get fast feedback without losing coverage.
- On pull request: Fast regression suite, 20 to 100 cases, target under 10 minutes.
- On merge to main: Broader suite, 200 to 500 cases, includes more judge-based checks.
- Nightly: Full suite, adversarial cases, multiple judge models, cost and latency trend checks.
- Before release: Production candidate test against frozen baseline and release-specific cases.
This setup catches obvious regressions quickly while giving your team deeper coverage before users see the change.
Make failures easy to review
A CI failure should tell the developer what broke and where to look. Avoid logs that only say “eval failed.” Include case IDs, tags, expected behavior, actual output, judge reasoning, and links to traces when available.
Good pull request comment
LLM evals completed for support-agent@pr-184.
Result: review required
Pass rate:
- Baseline: 91%
- Candidate: 88%
- Change: -3%
Cost:
- Baseline average: $0.018
- Candidate average: $0.022
- Change: +22%
Latency:
- Baseline P95: 3.1s
- Candidate P95: 4.2s
- Change: +35%
Failed high-risk cases:
1. support_refund_001
Tag: policy, refund
Reason: Candidate offered an exception not present in the policy.
2. retrieval_empty_001
Tag: rag, no-context
Reason: Candidate answered without supporting context.This gives reviewers enough information to decide whether to revise the prompt, update retrieval, accept the tradeoff, or adjust the eval if the expected behavior is outdated.
Common mistakes to avoid
- Using one LLM judge as the final authority: Add deterministic checks where possible. Use multiple judges for high-risk subjective cases.
- Writing vague rubrics: Break grading into specific criteria such as grounding, policy accuracy, completeness, and format validity.
- Testing only happy paths: Add malformed inputs, missing context, adversarial requests, tool errors, and edge cases.
- Letting datasets go stale: Add real production failures and remove outdated expectations.
- Blocking releases on tiny score changes: Use thresholds and review states for noisy metrics.
- Ignoring cost and latency: Track tokens, provider cost, and P95 latency with every run.
- Treating eval scores as absolute truth: Use evals to guide release decisions, not replace engineering judgment.
A practical CI gating policy
You can start with this policy and tune it as your eval suite matures:
hard_fail:
- critical_policy_failures > 0
- structured_output_validity < 1.0
- tool_call_validity < 0.98
review_required:
- pass_rate_drop > 0.05
- high_risk_case_failures > 0
- avg_cost_increase > 0.20
- p95_latency_increase > 0.25
- judge_disagreement_rate > 0.15
pass:
- no hard failures
- no review_required conditionsThis keeps CI strict for deterministic and safety-critical failures while allowing review for noisy or tradeoff-heavy changes.
How PromptLayer fits into CI evals
PromptLayer helps teams manage prompt versions, run evals, inspect traces, compare outputs, and track behavior over time. In CI, that means you can connect a pull request to the exact prompt version, dataset, model call, judge result, cost, and latency behind each eval run.
For teams shipping prompts, agents, and LLM workflows, this creates a cleaner release process. You can test candidate prompts before merge, compare them to production, and debug failures without digging through raw provider logs.
Final checklist
- Run a small CI eval suite on every relevant pull request.
- Compare candidate behavior against a production or main-branch baseline.
- Use deterministic checks for format, schema, labels, and tool calls.
- Use LLM judges only with clear rubrics and reviewable explanations.
- Track pass rate, cost, latency, tokens, and high-risk failures.
- Refresh datasets with real production failures.
- Use hard gates for critical failures and review gates for noisy metrics.
PromptLayer gives AI engineering teams the tools to run prompt and LLM workflow evals with versioning, tracing, datasets, and release checks in one place. Create an account at https://dashboard.promptlayer.com/create-account to start testing your prompts and agents before they ship.