Understanding Linearity of Variance in LLM Evaluations: A Guide for AI Engineers

How to Apply Linearity of Variance to Evals

LLM evals usually produce noisy measurements. A prompt can score 82% on one run and 77% on the next. A judge model can rate the same answer differently. A small test set can make a weak prompt look strong. If you ship prompt changes, agents, RAG workflows, or model upgrades, you need to understand how much uncertainty sits behind your eval score.

Linearity of variance is one tool for doing that well. It helps you estimate the uncertainty of aggregate eval metrics, compare prompts with more care, and design better quality gates for production releases.

The main catch: people often say “variance is linear” too loosely. Variance only adds cleanly under specific assumptions. In real eval datasets, test cases are often correlated, judge scores may drift, and similar examples can move together. If you ignore that, your confidence intervals will look tighter than they should.

The core idea

For a random variable X, variance measures how much X tends to move around its mean:

Var(X) = E[(X - E[X])²]

If your eval metric is an average across test cases, you are usually working with something like:

M = (X₁ + X₂ + ... + Xₙ) / n

Here, Xᵢ could be a binary pass or fail result, a judge score, a human rating, a latency value, or a cost measurement for test case i.

If all test case scores are independent, the variance of the average is:

Var(M) = (Var(X₁) + Var(X₂) + ... + Var(Xₙ)) / n²

If each test case has the same variance σ², this becomes:

Var(M) = σ² / n

This is why larger eval sets tend to give more stable aggregate scores. If you quadruple the number of independent test cases, the standard error is cut in half:

SE(M) = σ / √n

That formula is useful, but it depends on a key assumption: independence.

The part teams often miss: covariance

Eval examples are rarely fully independent. If you test 20 slight variations of the same customer support refund question, those examples will likely move together. A prompt change that improves refund policy reasoning may improve all 20. A model regression that hurts policy extraction may hurt all 20.

When test cases move together, covariance matters.

For two variables:

Var(X + Y) = Var(X) + Var(Y) + 2Cov(X, Y)

For an average across many eval cases:

Var(M) = (1 / n²) × [Σ Var(Xᵢ) + 2Σ Cov(Xᵢ, Xⱼ)]

The covariance terms can be large. If many examples are similar, your eval set may behave like a much smaller set than the row count suggests.

A simple example

Say you have 100 eval cases and a pass/fail metric. The average pass rate is 80%. If you treat every case as independent, the estimated variance of the mean is roughly:

p(1 - p) / n = 0.8 × 0.2 / 100 = 0.0016

The standard error is:

√0.0016 = 0.04

So your pass rate is about 80% ± 8% for a rough 95% interval.

But if those 100 cases are really 10 clusters of near-duplicate examples, your effective sample size may be closer to 10 than 100. Then:

0.8 × 0.2 / 10 = 0.016

The standard error becomes:

√0.016 = 0.126

Now your rough 95% interval is closer to 80% ± 25%. That is a very different release decision.

How this applies to LLM eval workflows

Most AI teams use evals in a few recurring places:

Prompt releases: Decide whether a new prompt version is safer, cheaper, faster, or more accurate than the current version.
Regression testing: Catch behavior changes before shipping code, prompt, retrieval, or model updates.
Agent testing: Measure task completion, tool-use correctness, failure recovery, and policy compliance.
RAG evaluation: Check retrieval quality, answer faithfulness, citation accuracy, and refusal behavior.
Production quality gates: Block releases when core metrics fall below a threshold or when uncertainty is too high.

Variance tells you whether your metric is stable enough to trust. Linearity of variance tells you how uncertainty combines when you average scores, combine subsets, or compare prompt versions.

Use case 1: estimating uncertainty for a mean eval score

Suppose your LLM judge returns scores from 1 to 5. You run 200 eval cases and get an average score of 4.2.

That number alone is not enough. You also need to know how spread out the scores are. A mean of 4.2 with most scores between 4 and 5 is different from a mean of 4.2 with many 1s and many 5s.

Use the sample variance:

s² = Σ(xᵢ - x̄)² / (n - 1)

Then estimate the standard error of the mean:

SE = s / √n

For example:

n = 200
mean = 4.2
sample standard deviation = 0.9
SE = 0.9 / √200 = 0.064

A rough 95% confidence interval is:

mean ± 1.96 × SE = 4.2 ± 0.13

So you would report this as approximately 4.07 to 4.33, assuming the cases are reasonably independent.

If your dataset has clusters, calculate uncertainty by cluster or use bootstrapping at the cluster level. Do not pretend 200 similar examples give you the same certainty as 200 independent examples.

Use case 2: comparing two prompt versions

A common mistake is comparing prompts by mean score alone.

Example:

Prompt A average score: 4.18
Prompt B average score: 4.24

Prompt B looks better by 0.06. But is that real, or noise?

If both prompts run on the same test cases, compare them using paired differences:

Dᵢ = ScoreBᵢ - ScoreAᵢ

Then compute:

mean(D)

and:

SE(D) = sD / √n

This works better than comparing two independent means because the same test case often has shared difficulty across prompts. Pairing accounts for that.

Example

Say you run both prompts on 300 cases:

Mean difference: +0.06
Standard deviation of differences: 0.50
Standard error: 0.50 / √300 = 0.029

A rough 95% interval for the difference is:

0.06 ± 1.96 × 0.029 = 0.06 ± 0.057

The interval is approximately 0.003 to 0.117. Prompt B is likely better, but the margin is small. You may ship it if the change is low risk, but you would not want to claim a major improvement.

If the interval were -0.02 to 0.14, the result would be inconclusive. In that case, run more cases, segment by task type, or inspect failures before release.

Use case 3: combining multiple eval metrics

Teams often create weighted scores like this:

Total = 0.5 × Correctness + 0.3 × Faithfulness + 0.2 × Style

The expected value of a weighted sum is straightforward:

E[Total] = 0.5E[Correctness] + 0.3E[Faithfulness] + 0.2E[Style]

The variance needs more care:

Var(aX + bY) = a²Var(X) + b²Var(Y) + 2abCov(X, Y)

For three metrics, you also need covariance between each pair.

This matters because eval metrics are often correlated. Correctness and faithfulness may move together. Style and policy compliance may conflict in some refusal tasks. If you ignore covariance, your total score can look more stable than it is.

At minimum, track the variance of each metric and inspect correlations between metrics. If two judge scores are highly correlated, your combined score may not contain as much independent signal as you think.

Use case 4: setting production quality gates

A quality gate should account for uncertainty. A gate like this is fragile:

Ship if score > 0.85

If your eval score is 0.86 with a standard error of 0.04, you do not have strong evidence that the prompt clears the bar. A safer gate might be:

Ship if lower confidence bound > 0.85

For a rough 95% lower bound:

Lower bound = mean - 1.96 × SE

If:

mean = 0.90
SE = 0.015

Then:

lower bound = 0.90 - 1.96 × 0.015 = 0.871

This clears a threshold of 0.85.

If:

mean = 0.86
SE = 0.04

Then:

lower bound = 0.86 - 1.96 × 0.04 = 0.782

This should not clear the same gate, even though the mean is above 0.85.

Common mistakes when applying variance to evals

1. Saying variance is linear without assumptions

Variance is additive for independent variables. In general, covariance terms matter. If your eval cases are similar, generated from the same templates, or grouped by customer workflow, independence is a weak assumption.

Use this rule:

Var(X + Y) = Var(X) + Var(Y) + 2Cov(X, Y)

If Cov(X, Y) = 0, the clean additive version applies. Otherwise, it does not.

2. Ignoring covariance between similar test cases

Near-duplicate eval cases make metrics look more precise than they are. If 50 examples test the same behavior, they may act like 5 independent examples.

Group evals by task family, intent, source, or failure mode. Then estimate uncertainty across groups, not only across rows.

3. Averaging judge scores without checking variance

An average judge score can hide unstable behavior. A prompt with scores [4, 4, 4, 4] is different from one with [1, 5, 5, 5], even if both look strong at first glance.

Track:

Mean score
Standard deviation
Pass rate by threshold
Worst-case slices
Disagreement between judges, if you use multiple judges

4. Treating one eval run as definitive

LLM evals can vary because of sampling, model changes, judge instability, retrieved context, tool responses, and infrastructure differences. One run gives you one measurement. It does not prove the true quality level.

For high-risk releases, run repeated trials or fix randomness where possible. Store the exact prompt, model, parameters, dataset version, retrieved context, and judge configuration.

5. Comparing prompts by mean score alone

A prompt with a slightly higher average may have worse tail behavior. For example, it may improve easy cases while failing safety-critical or high-value cases.

Compare prompts by:

Mean difference on paired cases
Confidence interval for the difference
Regression count on important cases
Performance by slice, such as language, customer type, task type, or policy area
Latency and cost variance, not only average latency and cost

A practical workflow for applying this in your eval suite

Step 1: Store per-case results

Do not only store aggregate scores. Save one row per eval case with:

Dataset version
Prompt version
Model name and parameters
Input
Output
Reference answer, if available
Judge score or pass/fail result
Latency
Cost
Tags, such as task type, difficulty, customer segment, or risk level

You need per-case data to compute variance, paired differences, clusters, and regressions.

Step 2: Compute mean, variance, and standard error

For each metric, compute:

n
mean
sample variance
standard deviation
standard error
confidence interval

For binary pass/fail metrics, you can start with:

Var(X) = p(1 - p)

and:

SE = √(p(1 - p) / n)

For small samples or pass rates near 0 or 1, use Wilson intervals or bootstrap intervals instead of relying only on the normal approximation.

Step 3: Use paired comparisons for prompt releases

When comparing a candidate prompt against production, run both on the same eval cases. Compute per-case differences. This reduces noise and helps you see where the new prompt wins or regresses.

Track these values:

mean(new - old)
standard deviation of differences
standard error of differences
confidence interval for the difference
Number of critical regressions

Step 4: Account for clusters

If your eval set contains groups of related cases, add a cluster label. Examples:

refund_policy
billing_dispute
medical_refusal
contract_summarization
tool_call_retry

Then compute metrics by cluster and inspect variation across clusters. For high-stakes gates, bootstrap clusters instead of individual rows. That means you resample groups, preserving the fact that related examples move together.

Step 5: Report uncertainty in release reviews

A release note should not say:

New prompt improved score from 84% to 87%.

A better version is:

New prompt improved paired pass rate by 3.1 percentage points, with a 95% interval of 0.8 to 5.4 points. No regressions were found in the 42 high-risk policy cases. Latency increased by 6%.

This gives your team enough information to make a release decision.

How many eval cases do you need?

You can estimate sample size using the standard error formula.

For a binary metric:

SE = √(p(1 - p) / n)

If you want a rough margin of error m at 95% confidence:

n ≈ 1.96² × p(1 - p) / m²

If you do not know p, use 0.5 because it gives the most conservative estimate.

Examples:

For a margin of error of ±10%: n ≈ 97
For ±5%: n ≈ 385
For ±2%: n ≈ 2401

These numbers assume independent cases. If your eval set has strong clusters, you need more cases or better coverage across clusters.

What to do when evals are expensive

LLM evals can be costly, especially with judge models, multi-turn agents, and RAG traces. If you cannot run thousands of cases every time, split your evals into tiers:

Fast smoke evals: 20 to 50 cases that run on every pull request. These catch obvious prompt and code failures.
Release evals: 200 to 500 cases that run before shipping a prompt, model, retrieval, or agent change.
Full regression suites: 1,000 or more cases that run nightly or before major releases.
Production monitoring: Ongoing sampling from real traffic, with offline labels or judge-based scoring.

Use variance to decide where to spend eval budget. If one task slice has high variance, add more examples there. If another slice is stable and low risk, you may not need as many repeated checks.

A release checklist

Before you ship an LLM change, ask these questions:

Did we compare the new version against the current version on the same cases?
Did we calculate variance or standard error, rather than only reporting the mean?
Are any test cases near-duplicates that create hidden covariance?
Did we inspect high-risk slices separately?
Did the new version regress on any critical cases?
Is the lower confidence bound above our production threshold?
Did we store prompt, model, dataset, judge, and parameter versions?
Are latency and cost changes acceptable, including their variance?

Key takeaways

Variance helps you measure uncertainty in eval scores.
Variance adds cleanly only when variables are independent or covariance is zero.
Similar eval cases often have positive covariance, which makes your effective sample size smaller.
Prompt comparisons should use paired differences when both versions run on the same cases.
Quality gates should account for uncertainty, not only mean score.
Store per-case results so you can compute variance, regressions, slices, and confidence intervals.

Linearity of variance is not extra math for its own sake. It helps you avoid false confidence when shipping LLM systems. In production evals, the question is rarely “Which prompt has the highest average?” The better question is “Do we have enough evidence that this change improves quality without adding unacceptable risk?”

PromptLayer helps AI teams manage prompts, datasets, evals, traces, and release workflows in one place. You can track prompt versions, run evaluations, inspect regressions, and build quality gates before changes reach production. Create a PromptLayer account to start testing and shipping LLM applications with more confidence.

How to Choose Top-Rated LLM Optimization Software

How to Set Up Datadog for LLM Observability

How to Apply Linearity of Variance to Evals

How to Apply Linearity of Variance to Evals

The core idea

The part teams often miss: covariance

A simple example

How this applies to LLM eval workflows

Use case 1: estimating uncertainty for a mean eval score

Use case 2: comparing two prompt versions

Example

Use case 3: combining multiple eval metrics

Use case 4: setting production quality gates

Common mistakes when applying variance to evals

1. Saying variance is linear without assumptions

2. Ignoring covariance between similar test cases

3. Averaging judge scores without checking variance

4. Treating one eval run as definitive

5. Comparing prompts by mean score alone

A practical workflow for applying this in your eval suite

Step 1: Store per-case results

Step 2: Compute mean, variance, and standard error

Step 3: Use paired comparisons for prompt releases

Step 4: Account for clusters

Step 5: Report uncertainty in release reviews

How many eval cases do you need?

What to do when evals are expensive

A release checklist

Key takeaways

How to Track LLM Analytics in PostHog

How to Choose LLM Tracking Tools

How to Start Prompt Versioning

The first platform built for prompt engineering

Usage

Company

Follow Us

How to Apply Linearity of Variance to Evals

How to Apply Linearity of Variance to Evals

The core idea

The part teams often miss: covariance

A simple example

How this applies to LLM eval workflows

Use case 1: estimating uncertainty for a mean eval score

Use case 2: comparing two prompt versions

Example

Use case 3: combining multiple eval metrics

Use case 4: setting production quality gates

Common mistakes when applying variance to evals

1. Saying variance is linear without assumptions

2. Ignoring covariance between similar test cases

3. Averaging judge scores without checking variance

4. Treating one eval run as definitive

5. Comparing prompts by mean score alone

A practical workflow for applying this in your eval suite

Step 1: Store per-case results

Step 2: Compute mean, variance, and standard error

Step 3: Use paired comparisons for prompt releases

Step 4: Account for clusters

Step 5: Report uncertainty in release reviews

How many eval cases do you need?

What to do when evals are expensive

A release checklist

Key takeaways

RECENT ARTICLES

The first platform built for prompt engineering

Usage

Company

Follow Us