Back

How to Compare LLM Outputs in CI

Jun 06, 2026
How to Compare LLM Outputs in CI

How to Compare LLM Outputs in CI

LLM output comparison in CI should answer one question: did this change make the model behavior better, worse, or risky enough to block the merge?

A normal unit test usually checks one expected value. LLM applications need a different setup because outputs can vary, and a harmless wording change may still pass. Your CI should compare behavior across a fixed set of examples, score the results with the right evaluation method, and fail only when the change crosses a clear threshold.

This is useful when you change:

  • A prompt or system message
  • A model version, such as moving from one GPT or Claude release to another
  • A retrieval strategy or chunking pipeline
  • An agent tool definition
  • A chain step that changes downstream context
  • Structured output schemas

Start with a fixed eval dataset

Do not compare LLM outputs against random ad hoc prompts in CI. Use a versioned dataset that represents the behavior your application must preserve.

A good CI eval dataset usually has 50 to 500 examples. Start smaller if your app is early, but keep the examples real. Pull them from production logs, support tickets, user queries, failed cases, or manually written edge cases.

Each row should include:

  • Input: the user message or task
  • Context: retrieved documents, metadata, user profile, tool responses, or previous messages
  • Expected behavior: what a correct answer must do
  • Comparison method: exact match, schema validation, rubric score, pairwise judge, or custom code
  • Tags: billing, safety, citation, refusal, routing, extraction, summarization, and similar categories

For example, a support bot eval row might look like this:

{
  "id": "refund_policy_017",
  "input": "Can I get a refund if my annual plan renewed yesterday?",
  "context": {
    "plan": "annual",
    "renewed_days_ago": 1,
    "policy_doc": "Annual plans can be refunded within 3 days of renewal."
  },
  "expected": {
    "must_include": ["refund eligible", "within 3 days"],
    "must_not_include": ["no refunds"],
    "tone": "clear and helpful"
  },
  "tags": ["support", "policy", "refund"]
}

If you need a broader framework for scoring model behavior, read the PromptLayer guide to LLM evaluation.

Decide what you are comparing against

In CI, you usually compare the pull request output against one of three baselines.

1. Golden expected outputs

Use this when the answer has a known correct form. This works well for extraction, classification, routing, code generation with tests, and structured JSON responses.

Example:

{
  "input": "Book a demo for next Tuesday at 2pm with sam@example.com",
  "expected": {
    "intent": "book_demo",
    "email": "sam@example.com",
    "date": "next Tuesday",
    "time": "2pm"
  }
}

2. Current main branch outputs

Use this when you want to catch regressions caused by a prompt, chain, retrieval, or model change. Your CI runs the same dataset on the main branch version and on the PR version, then compares scores.

This is the most common setup for prompt changes. It lets you ask: did this PR improve the eval set, or did it break important cases?

3. Production reference outputs

Use this when production behavior is already trusted. You store approved traces or outputs from real traffic, then compare new changes against them.

This works well for mature systems, but you need to avoid freezing bad behavior. Mark known-bad examples clearly and exclude them from pass criteria until you fix them.

Use the right comparison method for each task

Plain text diff is usually too strict. If the old answer says “You are eligible for a refund” and the new answer says “Your renewal is still within the refund window,” a string diff may report a large change even though the behavior is correct.

Use different comparison methods depending on the task.

Exact match

Use exact match for labels, enum values, routing decisions, and short deterministic outputs.

Good examples:

  • Intent classification: cancel_subscription
  • Priority routing: high, medium, low
  • Binary decisions: approve or reject

Bad examples:

  • Open-ended support responses
  • Long summaries
  • Agent plans with multiple valid paths

Schema validation

Use schema validation when your application depends on machine-readable output. Validate that the response parses, matches the schema, and contains values in the correct format.

For JSON outputs, fail the test before judging quality if the response cannot be parsed. A beautiful answer is still a production bug if your application expects valid JSON.

{
  "type": "object",
  "required": ["intent", "confidence", "entities"],
  "properties": {
    "intent": {
      "type": "string",
      "enum": ["book_demo", "cancel_demo", "reschedule_demo", "unknown"]
    },
    "confidence": {
      "type": "number",
      "minimum": 0,
      "maximum": 1
    },
    "entities": {
      "type": "object"
    }
  }
}

For more on reliable JSON and typed responses, see PromptLayer’s glossary entry on structured outputs.

Field-level comparison

Use field-level comparison when some fields matter more than others.

For example, in a lead qualification workflow, you may require an exact match on company_domain and lead_status, but allow a small difference in confidence.

{
  "checks": [
    {
      "field": "lead_status",
      "type": "exact"
    },
    {
      "field": "confidence",
      "type": "numeric_tolerance",
      "tolerance": 0.10
    },
    {
      "field": "summary",
      "type": "rubric",
      "minimum_score": 4
    }
  ]
}

Rubric scoring

Use rubric scoring for open-ended answers. Define what a good answer must include, what it must avoid, and how strict the score should be.

A support answer rubric might score each output from 1 to 5:

  • 5: Correct, complete, cites the right policy, and gives a clear next step
  • 4: Correct and clear, but missing a minor detail
  • 3: Partially correct, but incomplete or vague
  • 2: Misleading or missing the main policy detail
  • 1: Incorrect, unsafe, or directly contradicts policy

You can score rubrics with a judge model, a custom grader, or manual review for a smaller sample.

Pairwise comparison

Use pairwise comparison when you want to know whether the PR output is better than the baseline output. This works well for summarization, customer support, sales assistants, and content generation.

The judge sees the input, the baseline answer, and the candidate answer. It returns one of:

  • candidate_better
  • baseline_better
  • tie

Pairwise comparison is often more stable than asking a judge to assign an absolute score, because the judge only needs to compare two outputs for the same input.

If you use a model to grade another model, define the judge prompt carefully and test it like any other production component. PromptLayer has a separate guide to LLM-as-a-judge workflows.

Make CI comparisons deterministic enough

Temperature 0 helps, but it does not guarantee identical output across providers, model versions, regions, or time. Treat LLM tests as measured evaluations, not pure deterministic unit tests.

To reduce noise:

  • Set temperature to 0 or the lowest useful value for CI
  • Pin the exact model version when the provider supports it
  • Set a seed if the provider supports seeded generation
  • Mock tool calls, API responses, timestamps, and user metadata
  • Use fixed retrieval snapshots instead of a changing live index
  • Cache baseline outputs when possible
  • Run flaky examples more than once before blocking a merge

For RAG applications, snapshot the retrieved context. If your CI test hits a live vector database that changes daily, you will not know whether a failure came from the prompt, model, retriever, embedding model, or document set.

Compare traces, not only final answers

For agents and prompt chains, the final answer may look acceptable while the system took a broken path. CI should compare intermediate behavior when it affects reliability.

Useful trace-level checks include:

  • Which tools were called
  • Whether required tools were called before answering
  • Tool call arguments
  • Number of model calls
  • Total token usage
  • Retrieved document IDs
  • Latency by step
  • Whether the model followed the expected chain path

Example agent check:

{
  "input": "What is the status of invoice INV-1042?",
  "expected_trace": {
    "must_call_tools": ["lookup_invoice"],
    "must_not_call_tools": ["issue_refund"],
    "max_model_calls": 3
  }
}

This prevents a PR from passing because the final text sounds good while the agent skipped a required lookup.

Set pass and fail thresholds

Your CI should not fail because one subjective judge score moved from 4 to 3 on a borderline example. It should fail when the change creates a meaningful regression.

Use thresholds such as:

  • Schema validity must stay at 100%
  • Critical safety tests must have 0 failures
  • Intent classification accuracy must not drop by more than 1%
  • Average rubric score must not drop by more than 0.2
  • Pairwise results must have at least 60% candidate wins and less than 10% baseline wins
  • P95 latency must not increase by more than 20%
  • Cost per run must not increase by more than 15%

Split thresholds by tag. A one-point drop on a low-risk content rewrite may be fine. A single failure on medical advice, legal disclaimers, payment actions, or account deletion should block the merge.

Use a CI workflow that separates hard and soft gates

Hard gates should block a merge. Soft gates should report changes and ask for review.

Good hard gates:

  • Invalid JSON
  • Missing required fields
  • Wrong tool called for a critical action
  • Failed safety refusal
  • Wrong classification label for a high-risk route
  • Major regression against a golden answer

Good soft gates:

  • Small wording differences
  • Minor summary quality changes
  • Moderate cost increases
  • Judge disagreement on subjective examples
  • Lower score on a known flaky example

This keeps CI useful. If every harmless wording change blocks a merge, engineers will start bypassing the eval suite.

Example GitHub Actions setup

A simple GitHub Actions workflow can run your eval suite on every pull request.

name: LLM evals

on:
  pull_request:
    paths:
      - "prompts/**"
      - "chains/**"
      - "src/ai/**"
      - "evals/**"

jobs:
  compare-llm-outputs:
    runs-on: ubuntu-latest

    steps:
      - name: Check out repository
        uses: actions/checkout@v4

      - name: Set up Node
        uses: actions/setup-node@v4
        with:
          node-version: 20

      - name: Install dependencies
        run: npm ci

      - name: Run LLM output comparison
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: npm run eval:ci

Your eval:ci command should produce a machine-readable result, such as:

{
  "status": "failed",
  "summary": {
    "total": 120,
    "passed": 113,
    "failed": 7,
    "schema_validity": 0.991,
    "avg_score_before": 4.31,
    "avg_score_after": 4.08,
    "candidate_wins": 42,
    "baseline_wins": 19,
    "ties": 59
  },
  "blocking_failures": [
    {
      "id": "refund_policy_017",
      "reason": "Candidate answer incorrectly said annual renewals are not refundable."
    }
  ]
}

Post the summary as a PR comment so reviewers can inspect failures without opening CI logs.

Store artifacts for review

When a CI eval fails, the reviewer needs to see the input, baseline output, candidate output, score, judge reasoning, and trace. Store these as artifacts or send them to your LLM observability system.

For each failed example, capture:

  • Prompt version
  • Model name and version
  • Input and context
  • Baseline output
  • Candidate output
  • Judge result and judge prompt
  • Tool calls and retrieved documents
  • Token usage and latency
  • Error messages or parse failures

This is where LLM observability becomes part of the CI loop. You need enough detail to debug the failure, not just a red build.

Handle flaky evals directly

Some evals will be unstable. Do not ignore that. Track flakiness as a property of the test.

Common causes include:

  • The expected answer is ambiguous
  • The judge rubric is vague
  • The model is sensitive to small wording changes
  • The retrieved context changes between runs
  • The test depends on current dates or external APIs
  • The task has multiple valid answers, but the comparison method expects one

Fix flaky evals by tightening the expected behavior, changing the comparison method, or tagging them as non-blocking until they are stable.

A practical rule: if an example fails on main more than 5% of the time, do not use it as a hard CI gate. Keep it in the report, but fix the test before it blocks merges.

Keep cost and runtime under control

LLM CI can get expensive if every pull request runs hundreds of examples through multiple models and judges. Use tiers.

A common setup:

  • PR smoke eval: 20 to 50 examples, runs on every AI-related PR
  • Full regression eval: 200 to 1,000 examples, runs before release or nightly
  • Production replay eval: sampled real traces, runs before major prompt or model changes

You can also route tests by changed files. If a PR only changes the billing support prompt, run the billing support evals first. Run the full suite later or on demand.

Review failures like code changes

LLM output changes should be reviewed with the same care as backend changes. A good PR should include:

  • What prompt, chain, model, or retrieval logic changed
  • Which eval suite ran
  • Before and after scores
  • Blocking failures, if any
  • Accepted regressions, with a short reason
  • Examples where the new output is clearly better

If the PR accepts a regression, record it. Future reviewers need to know whether the team made a conscious tradeoff or missed the failure.

A practical CI comparison checklist

  • Use a fixed, versioned eval dataset
  • Tag examples by feature, risk, and behavior type
  • Compare against golden outputs, main branch, or production references
  • Use exact match only when exact match fits the task
  • Validate schemas before judging content quality
  • Use rubric or pairwise judging for open-ended answers
  • Pin model versions where possible
  • Snapshot retrieval context for RAG tests
  • Compare traces for agents and chains
  • Set separate hard and soft gates
  • Track flaky tests and keep them out of hard gates
  • Store failed examples, judge outputs, and traces for review
  • Run smaller evals on PRs and larger evals before releases

Final thoughts

Comparing LLM outputs in CI is less about finding identical text and more about protecting product behavior. The strongest setup combines deterministic checks, schema validation, trace checks, rubric scoring, and pairwise comparison. Start with a small eval set, make the results visible in PRs, and tighten the gates as your application matures.

If your team is already shipping prompts, agents, or LLM workflows, build CI around the behaviors users depend on. That gives you a safer path to change prompts, switch models, and improve your system without guessing whether a release made things worse.


PromptLayer helps AI teams manage prompts, run evaluations, compare outputs, and inspect traces across development and production. If you want a cleaner workflow for LLM CI and prompt review, create a PromptLayer account.

The first platform built for prompt engineering