How to Apply Linearity of Variance to Evals
How to Apply Linearity of Variance to Evals
LLM evals usually produce noisy measurements. A prompt can score 82% on one run and 77% on the next. A judge model can rate the same answer differently. A small test set can make a weak prompt look strong. If you ship prompt changes, agents, RAG workflows, or model upgrades, you need to understand how much uncertainty sits behind your eval score.
Linearity of variance is one tool for doing that well. It helps you estimate the uncertainty of aggregate eval metrics, compare prompts with more care, and design better quality gates for production releases.
The main catch: people often say “variance is linear” too loosely. Variance only adds cleanly under specific assumptions. In real eval datasets, test cases are often correlated, judge scores may drift, and similar examples can move together. If you ignore that, your confidence intervals will look tighter than they should.
The core idea
For a random variable X, variance measures how much X tends to move around its mean:
Var(X) = E[(X - E[X])²]
If your eval metric is an average across test cases, you are usually working with something like:
M = (X₁ + X₂ + ... + Xₙ) / n
Here, Xᵢ could be a binary pass or fail result, a judge score, a human rating, a latency value, or a cost measurement for test case i.
If all test case scores are independent, the variance of the average is:
Var(M) = (Var(X₁) + Var(X₂) + ... + Var(Xₙ)) / n²
If each test case has the same variance σ², this becomes:
Var(M) = σ² / n
This is why larger eval sets tend to give more stable aggregate scores. If you quadruple the number of independent test cases, the standard error is cut in half:
SE(M) = σ / √n
That formula is useful, but it depends on a key assumption: independence.
The part teams often miss: covariance
Eval examples are rarely fully independent. If you test 20 slight variations of the same customer support refund question, those examples will likely move together. A prompt change that improves refund policy reasoning may improve all 20. A model regression that hurts policy extraction may hurt all 20.
When test cases move together, covariance matters.
For two variables:
Var(X + Y) = Var(X) + Var(Y) + 2Cov(X, Y)
For an average across many eval cases:
Var(M) = (1 / n²) × [Σ Var(Xᵢ) + 2Σ Cov(Xᵢ, Xⱼ)]
The covariance terms can be large. If many examples are similar, your eval set may behave like a much smaller set than the row count suggests.
A simple example
Say you have 100 eval cases and a pass/fail metric. The average pass rate is 80%. If you treat every case as independent, the estimated variance of the mean is roughly:
p(1 - p) / n = 0.8 × 0.2 / 100 = 0.0016
The standard error is:
√0.0016 = 0.04
So your pass rate is about 80% ± 8% for a rough 95% interval.
But if those 100 cases are really 10 clusters of near-duplicate examples, your effective sample size may be closer to 10 than 100. Then:
0.8 × 0.2 / 10 = 0.016
The standard error becomes:
√0.016 = 0.126
Now your rough 95% interval is closer to 80% ± 25%. That is a very different release decision.
How this applies to LLM eval workflows
Most AI teams use evals in a few recurring places:
- Prompt releases: Decide whether a new prompt version is safer, cheaper, faster, or more accurate than the current version.
- Regression testing: Catch behavior changes before shipping code, prompt, retrieval, or model updates.
- Agent testing: Measure task completion, tool-use correctness, failure recovery, and policy compliance.
- RAG evaluation: Check retrieval quality, answer faithfulness, citation accuracy, and refusal behavior.
- Production quality gates: Block releases when core metrics fall below a threshold or when uncertainty is too high.
Variance tells you whether your metric is stable enough to trust. Linearity of variance tells you how uncertainty combines when you average scores, combine subsets, or compare prompt versions.
Use case 1: estimating uncertainty for a mean eval score
Suppose your LLM judge returns scores from 1 to 5. You run 200 eval cases and get an average score of 4.2.
That number alone is not enough. You also need to know how spread out the scores are. A mean of 4.2 with most scores between 4 and 5 is different from a mean of 4.2 with many 1s and many 5s.
Use the sample variance:
s² = Σ(xᵢ - x̄)² / (n - 1)
Then estimate the standard error of the mean:
SE = s / √n
For example:
n = 200mean = 4.2sample standard deviation = 0.9SE = 0.9 / √200 = 0.064
A rough 95% confidence interval is:
mean ± 1.96 × SE = 4.2 ± 0.13
So you would report this as approximately 4.07 to 4.33, assuming the cases are reasonably independent.
If your dataset has clusters, calculate uncertainty by cluster or use bootstrapping at the cluster level. Do not pretend 200 similar examples give you the same certainty as 200 independent examples.
Use case 2: comparing two prompt versions
A common mistake is comparing prompts by mean score alone.
Example:
- Prompt A average score:
4.18 - Prompt B average score:
4.24
Prompt B looks better by 0.06. But is that real, or noise?
If both prompts run on the same test cases, compare them using paired differences:
Dᵢ = ScoreBᵢ - ScoreAᵢ
Then compute:
mean(D)
and:
SE(D) = sD / √n
This works better than comparing two independent means because the same test case often has shared difficulty across prompts. Pairing accounts for that.
Example
Say you run both prompts on 300 cases:
- Mean difference:
+0.06 - Standard deviation of differences:
0.50 - Standard error:
0.50 / √300 = 0.029
A rough 95% interval for the difference is:
0.06 ± 1.96 × 0.029 = 0.06 ± 0.057
The interval is approximately 0.003 to 0.117. Prompt B is likely better, but the margin is small. You may ship it if the change is low risk, but you would not want to claim a major improvement.
If the interval were -0.02 to 0.14, the result would be inconclusive. In that case, run more cases, segment by task type, or inspect failures before release.
Use case 3: combining multiple eval metrics
Teams often create weighted scores like this:
Total = 0.5 × Correctness + 0.3 × Faithfulness + 0.2 × Style
The expected value of a weighted sum is straightforward:
E[Total] = 0.5E[Correctness] + 0.3E[Faithfulness] + 0.2E[Style]
The variance needs more care:
Var(aX + bY) = a²Var(X) + b²Var(Y) + 2abCov(X, Y)
For three metrics, you also need covariance between each pair.
This matters because eval metrics are often correlated. Correctness and faithfulness may move together. Style and policy compliance may conflict in some refusal tasks. If you ignore covariance, your total score can look more stable than it is.
At minimum, track the variance of each metric and inspect correlations between metrics. If two judge scores are highly correlated, your combined score may not contain as much independent signal as you think.
Use case 4: setting production quality gates
A quality gate should account for uncertainty. A gate like this is fragile:
Ship if score > 0.85
If your eval score is 0.86 with a standard error of 0.04, you do not have strong evidence that the prompt clears the bar. A safer gate might be:
Ship if lower confidence bound > 0.85
For a rough 95% lower bound:
Lower bound = mean - 1.96 × SE
If:
mean = 0.90SE = 0.015
Then:
lower bound = 0.90 - 1.96 × 0.015 = 0.871
This clears a threshold of 0.85.
If:
mean = 0.86SE = 0.04
Then:
lower bound = 0.86 - 1.96 × 0.04 = 0.782
This should not clear the same gate, even though the mean is above 0.85.
Common mistakes when applying variance to evals
1. Saying variance is linear without assumptions
Variance is additive for independent variables. In general, covariance terms matter. If your eval cases are similar, generated from the same templates, or grouped by customer workflow, independence is a weak assumption.
Use this rule:
Var(X + Y) = Var(X) + Var(Y) + 2Cov(X, Y)
If Cov(X, Y) = 0, the clean additive version applies. Otherwise, it does not.
2. Ignoring covariance between similar test cases
Near-duplicate eval cases make metrics look more precise than they are. If 50 examples test the same behavior, they may act like 5 independent examples.
Group evals by task family, intent, source, or failure mode. Then estimate uncertainty across groups, not only across rows.
3. Averaging judge scores without checking variance
An average judge score can hide unstable behavior. A prompt with scores [4, 4, 4, 4] is different from one with [1, 5, 5, 5], even if both look strong at first glance.
Track:
- Mean score
- Standard deviation
- Pass rate by threshold
- Worst-case slices
- Disagreement between judges, if you use multiple judges
4. Treating one eval run as definitive
LLM evals can vary because of sampling, model changes, judge instability, retrieved context, tool responses, and infrastructure differences. One run gives you one measurement. It does not prove the true quality level.
For high-risk releases, run repeated trials or fix randomness where possible. Store the exact prompt, model, parameters, dataset version, retrieved context, and judge configuration.
5. Comparing prompts by mean score alone
A prompt with a slightly higher average may have worse tail behavior. For example, it may improve easy cases while failing safety-critical or high-value cases.
Compare prompts by:
- Mean difference on paired cases
- Confidence interval for the difference
- Regression count on important cases
- Performance by slice, such as language, customer type, task type, or policy area
- Latency and cost variance, not only average latency and cost
A practical workflow for applying this in your eval suite
Step 1: Store per-case results
Do not only store aggregate scores. Save one row per eval case with:
- Dataset version
- Prompt version
- Model name and parameters
- Input
- Output
- Reference answer, if available
- Judge score or pass/fail result
- Latency
- Cost
- Tags, such as task type, difficulty, customer segment, or risk level
You need per-case data to compute variance, paired differences, clusters, and regressions.
Step 2: Compute mean, variance, and standard error
For each metric, compute:
nmeansample variancestandard deviationstandard errorconfidence interval
For binary pass/fail metrics, you can start with:
Var(X) = p(1 - p)
and:
SE = √(p(1 - p) / n)
For small samples or pass rates near 0 or 1, use Wilson intervals or bootstrap intervals instead of relying only on the normal approximation.
Step 3: Use paired comparisons for prompt releases
When comparing a candidate prompt against production, run both on the same eval cases. Compute per-case differences. This reduces noise and helps you see where the new prompt wins or regresses.
Track these values:
mean(new - old)standard deviation of differencesstandard error of differencesconfidence interval for the difference- Number of critical regressions
Step 4: Account for clusters
If your eval set contains groups of related cases, add a cluster label. Examples:
refund_policybilling_disputemedical_refusalcontract_summarizationtool_call_retry
Then compute metrics by cluster and inspect variation across clusters. For high-stakes gates, bootstrap clusters instead of individual rows. That means you resample groups, preserving the fact that related examples move together.
Step 5: Report uncertainty in release reviews
A release note should not say:
New prompt improved score from 84% to 87%.
A better version is:
New prompt improved paired pass rate by 3.1 percentage points, with a 95% interval of 0.8 to 5.4 points. No regressions were found in the 42 high-risk policy cases. Latency increased by 6%.
This gives your team enough information to make a release decision.
How many eval cases do you need?
You can estimate sample size using the standard error formula.
For a binary metric:
SE = √(p(1 - p) / n)
If you want a rough margin of error m at 95% confidence:
n ≈ 1.96² × p(1 - p) / m²
If you do not know p, use 0.5 because it gives the most conservative estimate.
Examples:
- For a margin of error of
±10%:n ≈ 97 - For
±5%:n ≈ 385 - For
±2%:n ≈ 2401
These numbers assume independent cases. If your eval set has strong clusters, you need more cases or better coverage across clusters.
What to do when evals are expensive
LLM evals can be costly, especially with judge models, multi-turn agents, and RAG traces. If you cannot run thousands of cases every time, split your evals into tiers:
- Fast smoke evals: 20 to 50 cases that run on every pull request. These catch obvious prompt and code failures.
- Release evals: 200 to 500 cases that run before shipping a prompt, model, retrieval, or agent change.
- Full regression suites: 1,000 or more cases that run nightly or before major releases.
- Production monitoring: Ongoing sampling from real traffic, with offline labels or judge-based scoring.
Use variance to decide where to spend eval budget. If one task slice has high variance, add more examples there. If another slice is stable and low risk, you may not need as many repeated checks.
A release checklist
Before you ship an LLM change, ask these questions:
- Did we compare the new version against the current version on the same cases?
- Did we calculate variance or standard error, rather than only reporting the mean?
- Are any test cases near-duplicates that create hidden covariance?
- Did we inspect high-risk slices separately?
- Did the new version regress on any critical cases?
- Is the lower confidence bound above our production threshold?
- Did we store prompt, model, dataset, judge, and parameter versions?
- Are latency and cost changes acceptable, including their variance?
Key takeaways
- Variance helps you measure uncertainty in eval scores.
- Variance adds cleanly only when variables are independent or covariance is zero.
- Similar eval cases often have positive covariance, which makes your effective sample size smaller.
- Prompt comparisons should use paired differences when both versions run on the same cases.
- Quality gates should account for uncertainty, not only mean score.
- Store per-case results so you can compute variance, regressions, slices, and confidence intervals.
Linearity of variance is not extra math for its own sake. It helps you avoid false confidence when shipping LLM systems. In production evals, the question is rarely “Which prompt has the highest average?” The better question is “Do we have enough evidence that this change improves quality without adding unacceptable risk?”
PromptLayer helps AI teams manage prompts, datasets, evals, traces, and release workflows in one place. You can track prompt versions, run evaluations, inspect regressions, and build quality gates before changes reach production. Create a PromptLayer account to start testing and shipping LLM applications with more confidence.