Integrating LLM Evaluations into Your CI Pipeline: Best Practices for AI Teams

How to Run LLM Evals in CI

LLM evals belong in CI when your prompts, models, retrieval logic, tools, or agent workflows affect production behavior. A small prompt edit can break JSON formatting, increase latency, change tool calls, or make a support agent answer outside policy. CI gives your team a repeatable checkpoint before those changes merge.

The goal is not to turn evals into a perfect truth machine. The goal is to catch likely regressions early, compare changes against a known baseline, and give reviewers better release signals.

What to evaluate in CI

Start with the parts of your LLM system that break most often:

Prompt templates: instructions, examples, output schema, tone, refusal behavior, and task boundaries.
Retrieval changes: chunking, ranking, filters, rerankers, metadata, and context packing.
Model changes: model version, temperature, max tokens, tool support, and provider routing.
Agent workflows: tool selection, argument quality, stopping conditions, retry behavior, and error handling.
Structured outputs: JSON validity, schema conformance, enum values, and required fields.
Safety and policy behavior: disallowed content, privacy constraints, escalation paths, and refusal consistency.

If you are new to the topic, start with a basic definition of LLM evaluation, then build a CI process around the behaviors your product depends on.

Use a small, high-signal eval suite for CI

Do not run every eval on every pull request. CI needs to be fast enough that developers respect it. A useful first target is:

20 to 100 test cases for pull requests.
200 to 1,000 test cases for nightly or pre-release runs.
5 to 20 adversarial cases for critical workflows such as refunds, medical disclaimers, financial advice, or permission changes.

Your PR eval suite should cover the highest-risk and most frequently used paths. Include happy paths, but do not stop there. Add cases for missing context, ambiguous requests, malformed inputs, conflicting instructions, empty retrieval results, tool failures, and user attempts to bypass constraints.

Example eval dataset

[
  {
    "id": "support_refund_001",
    "input": "I bought this 45 days ago. Can I get a refund?",
    "expected_behavior": "Explain that the standard refund window is 30 days and offer escalation if policy exceptions apply.",
    "tags": ["support", "policy", "refund"]
  },
  {
    "id": "json_schema_001",
    "input": "Extract name and renewal date: Acme Corp renews on March 4, 2026.",
    "expected_json": {
      "company_name": "Acme Corp",
      "renewal_date": "2026-03-04"
    },
    "tags": ["structured-output"]
  },
  {
    "id": "retrieval_empty_001",
    "input": "What is our policy for enterprise crypto custody?",
    "expected_behavior": "Say the policy is not available in the provided context and avoid inventing details.",
    "tags": ["rag", "no-context", "hallucination"]
  }
]

Choose eval types that match the failure mode

Do not use one scoring method for everything. Different failures need different checks.

Failure mode	Best eval type	Example check
Invalid JSON	Deterministic test	Parse output and validate against a JSON Schema.
Wrong classification	Exact match or label match	Expected label is `billing_issue`.
Bad tool call	Programmatic assertion	Tool name must be `create_ticket` with required fields.
Hallucinated answer	Reference-based judge plus retrieval checks	Answer must be supported by provided context.
Poor helpfulness	LLM judge with a specific rubric	Score clarity, completeness, and policy compliance separately.
Latency regression	Performance metric	P95 latency must stay under 4 seconds.
Cost regression	Token and provider cost tracking	Average cost per run must not increase by more than 15%.

Use LLM judges carefully

An LLM judge is useful for subjective checks, but it should not be your only gate. Judge models can be inconsistent, biased toward longer answers, and sensitive to rubric wording. Use them as one signal in a broader eval setup.

If you use LLM-as-a-judge, make the rubric specific. Avoid vague criteria like “good answer” or “high quality.”

Weak rubric

Rate the answer from 1 to 5 based on quality.

Better rubric

You are grading a customer support answer.

Score each criterion as pass or fail:

1. Policy accuracy:
- Pass if the answer follows the refund policy in the reference.
- Fail if it invents a refund option or contradicts the policy.

2. Completeness:
- Pass if the answer gives the user a clear next step.
- Fail if it only says "no" without guidance.

3. Grounding:
- Pass if all policy claims are supported by the reference.
- Fail if any unsupported policy detail appears.

Return JSON:
{
  "policy_accuracy": "pass|fail",
  "completeness": "pass|fail",
  "grounding": "pass|fail",
  "reason": "short explanation"
}

For important releases, use two or three judge models and compare agreement. If one judge fails a case and two pass it, mark the result as “review needed” instead of blocking the merge automatically.

Set release gates that tolerate noise

LLM evals are noisy. If your CI blocks every pull request because one subjective score moved from 4.2 to 4.1, developers will ignore the system or work around it.

Use gates that focus on meaningful regression:

Hard fail: JSON schema validity drops below 100% for required structured outputs.
Hard fail: Safety or policy tests fail on critical cases.
Soft fail: Judge score drops by more than 5% against the baseline.
Soft fail: Cost per request increases by more than 20%.
Soft fail: P95 latency increases by more than 25%.
Review required: New failures appear in high-priority tags such as billing, compliance, or agent-tools.

Treat scores as release signals, not absolute truth. A lower score may be acceptable if the change fixes a more important product issue. A higher score may still be unsafe if it hides a policy regression.

Compare against a baseline

Run evals against both the current branch and a baseline, usually the latest production prompt or main branch. This catches regressions more reliably than judging a branch in isolation.

{
  "baseline": {
    "prompt_version": "support-agent@prod",
    "model": "gpt-4.1-mini",
    "pass_rate": 0.91,
    "avg_cost_usd": 0.018,
    "p95_latency_ms": 3100
  },
  "candidate": {
    "prompt_version": "support-agent@pr-184",
    "model": "gpt-4.1-mini",
    "pass_rate": 0.88,
    "avg_cost_usd": 0.022,
    "p95_latency_ms": 4200
  }
}

In this example, the candidate loses 3 percentage points of pass rate, raises cost by 22%, and raises P95 latency by 35%. Even if the answers look better in a few examples, this change should get reviewed before merge.

Example CI workflow with GitHub Actions

A simple CI setup has four steps:

Install dependencies.
Run the eval suite against the changed prompt, chain, or agent.
Compare results against the baseline.
Fail, warn, or comment on the pull request.

name: LLM Evals

on:
  pull_request:
    paths:
      - "prompts/**"
      - "agents/**"
      - "evals/**"
      - "src/llm/**"

jobs:
  evals:
    runs-on: ubuntu-latest
    timeout-minutes: 20

    steps:
      - name: Check out repo
        uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install dependencies
        run: |
          pip install -r requirements.txt

      - name: Run PR eval suite
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          PROMPTLAYER_API_KEY: ${{ secrets.PROMPTLAYER_API_KEY }}
        run: |
          python evals/run_ci_evals.py \
            --suite evals/ci.json \
            --baseline production \
            --candidate pr \
            --output eval-results.json

      - name: Check gates
        run: |
          python evals/check_gates.py \
            --results eval-results.json \
            --min-pass-rate 0.90 \
            --max-cost-increase 0.20 \
            --max-p95-latency-increase 0.25

Example eval runner structure

Your eval runner should separate test data, model calls, grading, and gate logic. That keeps the system easier to debug when a CI run fails.

# evals/run_ci_evals.py

import json
import time
from jsonschema import validate

def run_model(prompt, test_case):
    # Replace with your model provider or prompt platform call.
    return call_llm(
        prompt=prompt,
        input=test_case["input"],
        temperature=0
    )

def grade_schema(output, schema):
    try:
        validate(instance=json.loads(output), schema=schema)
        return {"pass": True, "reason": "valid schema"}
    except Exception as error:
        return {"pass": False, "reason": str(error)}

def grade_expected_json(output, expected):
    try:
        parsed = json.loads(output)
        return {
            "pass": parsed == expected,
            "reason": "matches expected JSON" if parsed == expected else "JSON mismatch"
        }
    except Exception as error:
        return {"pass": False, "reason": str(error)}

def run_case(prompt, test_case):
    start = time.time()
    output = run_model(prompt, test_case)
    latency_ms = int((time.time() - start) * 1000)

    if "expected_json" in test_case:
        grade = grade_expected_json(output, test_case["expected_json"])
    elif "schema" in test_case:
        grade = grade_schema(output, test_case["schema"])
    else:
        grade = grade_with_llm_judge(output, test_case)

    return {
        "id": test_case["id"],
        "pass": grade["pass"],
        "reason": grade["reason"],
        "latency_ms": latency_ms,
        "output": output,
        "tags": test_case.get("tags", [])
    }

def main():
    with open("evals/ci.json") as file:
        test_cases = json.load(file)

    prompt = load_candidate_prompt()
    results = [run_case(prompt, case) for case in test_cases]

    with open("eval-results.json", "w") as file:
        json.dump({"results": results}, file, indent=2)

if __name__ == "__main__":
    main()

Track cost and latency in the same run

A prompt change that improves answer quality by 2% but doubles cost may be a bad trade. A retrieval change that adds 3 seconds to P95 latency may hurt the user experience even if judge scores improve.

Capture at least these fields for every eval run:

Prompt version
Model name and version
Input tokens
Output tokens
Estimated cost
Latency
Tool calls
Retrieved document IDs
Pass or fail result
Judge explanation or assertion failure

This is where LLM observability becomes useful. When a CI run fails, you need to inspect the exact prompt, inputs, context, outputs, judge result, and metadata behind the failure.

Refresh eval datasets regularly

Eval datasets go stale when your product, users, policy, or retrieval corpus changes. A suite built three months ago may miss the bugs your users hit today.

Use a simple maintenance loop:

Weekly: Add 5 to 10 production failures or support escalations to the eval dataset.
Before major launches: Add cases for new features, new tools, and new policy paths.
After incidents: Add regression tests that would have caught the issue.
Monthly: Remove duplicate cases and fix outdated expected answers.

Keep your CI suite small, but keep the source dataset larger. Tag cases by feature, risk level, and failure type so CI can run the right subset.

{
  "id": "agent_tool_017",
  "risk": "high",
  "tags": ["agent", "tool-call", "billing", "ci"],
  "input": "Cancel the customer's subscription and refund the last invoice.",
  "expected_behavior": "Do not take billing action without account verification and explicit confirmation."
}

Run different eval suites at different stages

Use staged evals so developers get fast feedback without losing coverage.

On pull request: Fast regression suite, 20 to 100 cases, target under 10 minutes.
On merge to main: Broader suite, 200 to 500 cases, includes more judge-based checks.
Nightly: Full suite, adversarial cases, multiple judge models, cost and latency trend checks.
Before release: Production candidate test against frozen baseline and release-specific cases.

This setup catches obvious regressions quickly while giving your team deeper coverage before users see the change.

Make failures easy to review

A CI failure should tell the developer what broke and where to look. Avoid logs that only say “eval failed.” Include case IDs, tags, expected behavior, actual output, judge reasoning, and links to traces when available.

Good pull request comment

LLM evals completed for support-agent@pr-184.

Result: review required

Pass rate:
- Baseline: 91%
- Candidate: 88%
- Change: -3%

Cost:
- Baseline average: $0.018
- Candidate average: $0.022
- Change: +22%

Latency:
- Baseline P95: 3.1s
- Candidate P95: 4.2s
- Change: +35%

Failed high-risk cases:
1. support_refund_001
   Tag: policy, refund
   Reason: Candidate offered an exception not present in the policy.

2. retrieval_empty_001
   Tag: rag, no-context
   Reason: Candidate answered without supporting context.

This gives reviewers enough information to decide whether to revise the prompt, update retrieval, accept the tradeoff, or adjust the eval if the expected behavior is outdated.

Common mistakes to avoid

Using one LLM judge as the final authority: Add deterministic checks where possible. Use multiple judges for high-risk subjective cases.
Writing vague rubrics: Break grading into specific criteria such as grounding, policy accuracy, completeness, and format validity.
Testing only happy paths: Add malformed inputs, missing context, adversarial requests, tool errors, and edge cases.
Letting datasets go stale: Add real production failures and remove outdated expectations.
Blocking releases on tiny score changes: Use thresholds and review states for noisy metrics.
Ignoring cost and latency: Track tokens, provider cost, and P95 latency with every run.
Treating eval scores as absolute truth: Use evals to guide release decisions, not replace engineering judgment.

A practical CI gating policy

You can start with this policy and tune it as your eval suite matures:

hard_fail:
  - critical_policy_failures > 0
  - structured_output_validity < 1.0
  - tool_call_validity < 0.98

review_required:
  - pass_rate_drop > 0.05
  - high_risk_case_failures > 0
  - avg_cost_increase > 0.20
  - p95_latency_increase > 0.25
  - judge_disagreement_rate > 0.15

pass:
  - no hard failures
  - no review_required conditions

This keeps CI strict for deterministic and safety-critical failures while allowing review for noisy or tradeoff-heavy changes.

How PromptLayer fits into CI evals

PromptLayer helps teams manage prompt versions, run evals, inspect traces, compare outputs, and track behavior over time. In CI, that means you can connect a pull request to the exact prompt version, dataset, model call, judge result, cost, and latency behind each eval run.

For teams shipping prompts, agents, and LLM workflows, this creates a cleaner release process. You can test candidate prompts before merge, compare them to production, and debug failures without digging through raw provider logs.

Final checklist

Run a small CI eval suite on every relevant pull request.
Compare candidate behavior against a production or main-branch baseline.
Use deterministic checks for format, schema, labels, and tool calls.
Use LLM judges only with clear rubrics and reviewable explanations.
Track pass rate, cost, latency, tokens, and high-risk failures.
Refresh datasets with real production failures.
Use hard gates for critical failures and review gates for noisy metrics.

PromptLayer gives AI engineering teams the tools to run prompt and LLM workflow evals with versioning, tracing, datasets, and release checks in one place. Create an account at https://dashboard.promptlayer.com/create-account to start testing your prompts and agents before they ship.

How to Write GPT-5 Prompts for Production

How to Learn AI by Building LLM Apps

How to Run LLM Evals in CI

How to Run LLM Evals in CI

What to evaluate in CI

Use a small, high-signal eval suite for CI

Example eval dataset

Choose eval types that match the failure mode

Use LLM judges carefully

Weak rubric

Better rubric

Set release gates that tolerate noise

Compare against a baseline

Example CI workflow with GitHub Actions

Example eval runner structure

Track cost and latency in the same run

Refresh eval datasets regularly

Run different eval suites at different stages

Make failures easy to review

Good pull request comment

Common mistakes to avoid

A practical CI gating policy

How PromptLayer fits into CI evals

Final checklist

How to Write an LLM Prompt Spec

How to Get Prompt Engineering Certified

How to Learn AI by Building LLM Apps

The first platform built for prompt engineering

Usage

Company

Follow Us

How to Run LLM Evals in CI

How to Run LLM Evals in CI

What to evaluate in CI

Use a small, high-signal eval suite for CI

Example eval dataset

Choose eval types that match the failure mode

Use LLM judges carefully

Weak rubric

Better rubric

Set release gates that tolerate noise

Compare against a baseline

Example CI workflow with GitHub Actions

Example eval runner structure

Track cost and latency in the same run

Refresh eval datasets regularly

Run different eval suites at different stages

Make failures easy to review

Good pull request comment

Common mistakes to avoid

A practical CI gating policy

How PromptLayer fits into CI evals

Final checklist

RECENT ARTICLES

The first platform built for prompt engineering

Usage

Company

Follow Us