How to Use Total Variance in LLM Evals
How to Use Total Variance in LLM Evals
Total variance helps you understand how much your eval scores move across test cases, repeated runs, model calls, and judge decisions. For LLM apps, this matters because a prompt can look better after one eval run and fail after the next run with the same dataset.
If you ship agents, prompt chains, RAG flows, structured extraction, or support automation, you need more than a single average score. You need to know whether that score is stable enough to trust.
In practical terms, total variance answers this question:
“How much of my eval result is signal, and how much is noise?”
What total variance means in LLM evals
Assume you have an eval dataset with test cases. You run each test case multiple times because LLM output can vary. Each output receives a score from a rule-based check, a human, or an LLM judge.
Use this lightweight notation:
- i = test case number
- j = repeated run number
- score(i, j) = score for one test case on one run
Total observed variance is the variance across all scores in the eval run. It includes several sources:
- Between-test-case variance: some examples are easier than others.
- Within-test-case variance: the same prompt and input produce different results across repeated runs.
- Judge variance: an LLM judge may score the same output differently across runs.
- System variance: retrieval results, tools, context windows, timeouts, or agent paths may change.
A useful simple formula is:
Total observed variance ≈ between-test-case variance + within-test-case variance + judge variance + system variance
You do not need a complex statistics setup to start. You need repeated runs, clean dataset separation, and per-test-case reporting.
A small eval example
Imagine you are testing a support chatbot prompt. The bot must answer billing questions using retrieved policy documents. The eval score is binary:
- 1 = correct answer with the right policy reference
- 0 = incorrect, missing policy reference, or unsupported claim
You test Prompt A on five cases and run each case five times.
Eval run table
| Test case | Run 1 | Run 2 | Run 3 | Run 4 | Run 5 | Mean score | Variance |
|---|---|---|---|---|---|---|---|
| Refund policy for annual plan | 1 | 1 | 1 | 1 | 1 | 1.00 | 0.00 |
| Cancel trial before renewal | 1 | 1 | 0 | 1 | 1 | 0.80 | 0.16 |
| Invoice address change | 1 | 0 | 1 | 0 | 1 | 0.60 | 0.24 |
| Refund after account deletion | 0 | 0 | 1 | 0 | 0 | 0.20 | 0.16 |
| Tax exemption document request | 1 | 1 | 1 | 0 | 1 | 0.80 | 0.16 |
The overall average score is 0.68. The average per-test-case variance is 0.144.
The average score tells you the prompt is not strong enough. The variance tells you something more specific: several cases are unstable. The “Invoice address change” case has a mean of 0.60 and variance of 0.24, which means the model flips between correct and incorrect behavior for the same input.
How to calculate total variance without overcomplicating it
Start with two numbers:
- Overall mean score: average of all scores across all cases and repeated runs.
- Total observed variance: variance of all scores across all cases and repeated runs.
Then split variance into two practical views:
- Per-test-case variance: tells you which examples are unstable.
- Dataset-level variance: tells you how noisy the full eval result is.
Variance calculation table
| Metric | Example value | How to read it |
|---|---|---|
| Overall mean | 0.68 | The prompt passes 68% of scored attempts. |
| Total observed variance | 0.218 | Scores vary heavily across cases and repeated runs. |
| Average within-case variance | 0.144 | The same input often gets different outcomes. |
| Highest case variance | 0.24 | At least one test case is especially unstable. |
For many production evals, per-test-case variance is more useful than a single total variance number. It points you to the failing slice: a policy edge case, a retrieval miss, a tool call branch, or a vague instruction.
Use repeated runs before comparing prompts
A common mistake is running Prompt A once, running Prompt B once, then picking the higher average. That is risky because one run can be lucky.
Here is a better comparison:
| Prompt | Mean score | Total observed variance | Average within-case variance | Decision |
|---|---|---|---|---|
| Prompt A | 0.72 | 0.202 | 0.151 | Higher score, unstable |
| Prompt B | 0.70 | 0.121 | 0.064 | Slightly lower score, more stable |
If this prompt handles billing answers in production, Prompt B may be the safer choice. A two-point score gap is small. A much lower variance can matter more if users need consistent answers.
For a broader eval setup, define your test cases, scoring methods, and pass thresholds as part of a repeatable LLM evaluation process.
Separate model quality from variance
Do not read total variance as model quality. High variance can come from several places:
- The prompt is underspecified.
- The retrieval layer returns different documents.
- The model samples different valid answers.
- The judge has a vague rubric.
- The dataset mixes easy, hard, and unrelated cases.
- The agent takes different tool paths across runs.
A strong model can still produce high variance if your task has ambiguous inputs or nondeterministic retrieval. A weaker model can show low variance if it consistently gives the same wrong answer.
Track quality and stability separately:
- Quality: mean score, pass rate, task success rate, human preference rate.
- Stability: variance, standard deviation, disagreement rate, repeated-run failure rate.
Debugging with total variance
Total variance becomes useful when it changes how you debug. Start with the highest-variance test cases instead of only the lowest-scoring cases.
Before debugging
| Test case | Symptom | Mean score | Variance | Likely cause |
|---|---|---|---|---|
| Invoice address change | Sometimes asks user to contact support, sometimes gives correct self-serve steps | 0.60 | 0.24 | Prompt does not clearly rank policy source over generic support fallback |
| Refund after account deletion | Mostly wrong, occasionally correct | 0.20 | 0.16 | Retrieved context lacks account deletion refund policy |
For the “Invoice address change” case, you inspect traces and see that the retrieved document is correct every time. The output still changes. That points to prompt ambiguity, not retrieval.
You update the instruction:
Before: “Answer billing questions using the policy context when available.”
After: “If the policy context contains a self-serve billing action, give the self-serve steps first. Do not recommend contacting support unless the policy explicitly says support is required.”
After debugging
| Test case | Runs | Mean score before | Variance before | Mean score after | Variance after |
|---|---|---|---|---|---|
| Invoice address change | 5 | 0.60 | 0.24 | 1.00 | 0.00 |
| Cancel trial before renewal | 5 | 0.80 | 0.16 | 1.00 | 0.00 |
This is the pattern you want: the mean score improves and variance drops. The prompt became more correct and more stable.
Watch for judge randomness
If you use an LLM judge, your eval score may vary even when the app output stays the same. This is especially common when the judge rubric is vague, uses a broad 1 to 10 scale, or asks for subjective quality ratings.
Reduce judge randomness with these steps:
- Use a narrow rubric with clear pass and fail criteria.
- Prefer binary or 0 to 2 scoring when possible.
- Pin the judge model version.
- Set judge temperature as low as the provider allows.
- Run the judge multiple times on a sample of outputs to estimate judge variance.
- Log judge reasoning separately from the final score.
If your judge gives different scores for the same output, measure that separately. Do not blame the app prompt before checking the scorer. For more detail on this pattern, see LLM-as-a-judge.
Do not mix datasets when measuring variance
Dataset mixing is one of the easiest ways to produce misleading variance numbers. If one eval run contains billing questions and the next contains security questions, your variance may reflect dataset composition instead of prompt behavior.
Keep datasets separated by task type and risk level:
- Golden regression set: stable cases that every prompt version must pass.
- Edge-case set: rare or difficult cases you expect to be noisy.
- Adversarial set: prompt injection, policy conflict, malformed inputs, and unsafe requests.
- Fresh production sample: recent real user traffic, reviewed before use.
Report variance per dataset. A prompt with high variance on adversarial cases may still be acceptable if your golden regression set is stable. A prompt with high variance on common support questions needs work before release.
Use traces to explain variance
A variance number tells you where to look. Traces tell you what happened.
When a test case flips between pass and fail, inspect:
- The exact prompt version.
- The model and model version.
- Input variables.
- Retrieved documents and ranking order.
- Tool calls and tool responses.
- Agent steps.
- Output schema validation results.
- Judge prompt, judge model, and judge score.
This is where LLM observability becomes part of eval work. You need trace-level evidence to know whether variance came from the prompt, model, retrieval layer, tools, or judge.
A practical workflow for using total variance
- Freeze the dataset. Use the same test cases when comparing prompt versions.
- Run each case multiple times. Start with 3 to 5 runs per case. Use 10 or more for high-risk workflows.
- Score each output consistently. Use deterministic checks where possible. Use a clear judge rubric when needed.
- Calculate mean score and variance. Report both at the dataset level and test-case level.
- Sort by variance. Debug high-variance cases first, especially when the mean score is near your pass threshold.
- Inspect traces. Identify whether the instability comes from generation, retrieval, tools, agent routing, or judging.
- Change one thing at a time. Do not update the prompt, model, retriever, and judge in the same experiment unless you are doing a broad system test.
- Rerun the same eval. Compare mean, variance, and per-case behavior before declaring a win.
Common mistakes to avoid
Using one run per prompt
One run hides instability. If Prompt A scores 0.82 once and Prompt B scores 0.79 once, you do not know which prompt is better. Run repeated trials before making a release decision.
Comparing averages without variance
Two prompts can have the same average and very different production behavior. A prompt that scores 0.80 with low variance is usually safer than a prompt that swings between perfect and broken.
Mixing datasets
Do not compare one prompt on a clean regression set and another prompt on recent messy production samples. Keep the dataset fixed when comparing prompt versions.
Ignoring judge randomness
If an LLM judge is noisy, your app score will look noisy. Test the judge by scoring the same outputs multiple times.
Treating total variance as model quality
Total variance measures spread in your eval scores. It does not prove that one model is smarter, safer, or more capable. Use it with mean score, failure categories, traces, and human review for high-risk cases.
What good reporting looks like
A useful eval report should show enough detail for an engineer to act. At minimum, include:
- Prompt version
- Model name and version
- Dataset name and version
- Number of test cases
- Number of repeated runs per case
- Mean score
- Total observed variance
- Average within-case variance
- Top failing cases
- Top high-variance cases
- Judge configuration
Example release table
| Prompt version | Dataset | Cases | Runs per case | Mean | Total variance | Avg within-case variance | Release decision |
|---|---|---|---|---|---|---|---|
| billing-assistant-v12 | billing-regression-v4 | 120 | 5 | 0.91 | 0.082 | 0.031 | Pass |
| billing-assistant-v13 | billing-regression-v4 | 120 | 5 | 0.93 | 0.141 | 0.089 | Hold, debug high-variance cases |
Prompt v13 has a higher average, but it is less stable. You should inspect the high-variance cases before shipping it.
How many repeated runs do you need?
Use the risk of the workflow to choose the number of repeats:
- Prototype prompt: 3 runs per case can expose obvious instability.
- Internal tool: 5 runs per case is a reasonable default.
- Customer-facing support or sales workflow: 5 to 10 runs per case.
- High-risk workflow: 10 or more runs per case, plus human review and slice-based analysis.
You do not need to run every eval at maximum depth every time. Use a smaller repeated-run eval during prompt development, then run a larger stability eval before release.
Final checklist
- Run each test case more than once.
- Report mean score and variance together.
- Review per-test-case variance, not only dataset-level variance.
- Keep datasets fixed when comparing prompt versions.
- Measure judge variance if you use an LLM judge.
- Use traces to explain high-variance cases.
- Do not treat total variance as a direct measure of model quality.
Total variance gives your team a practical way to separate real prompt improvements from noisy eval results. Used well, it makes prompt releases less guessy and gives engineers a clear path for debugging unstable behavior.
PromptLayer helps teams manage prompt versions, run evals, inspect traces, compare scores, and debug LLM behavior across datasets and workflows. To start tracking eval quality and variance in one place, create a PromptLayer account.