How to Evaluate Summary Relevance
How to Evaluate Summary Relevance
Summary relevance measures whether a generated summary includes the information the user actually needed. For LLM applications, this is one of the most important summary quality checks because a fluent summary can still fail the task.
If your app summarizes support tickets, research papers, legal clauses, meeting transcripts, customer calls, or retrieved documents, relevance should be evaluated separately from factuality, tone, length, and writing quality. Otherwise, your scores will hide the specific failure mode you need to fix.
What summary relevance means
A relevant summary includes the source details that matter for the requested use case and excludes details that do not help the user complete their task.
For example, a sales manager and a support engineer may need different summaries of the same customer call:
- Sales manager need: buying intent, objections, timeline, stakeholders, next steps.
- Support engineer need: reported bug, reproduction steps, environment, severity, workaround.
The source text can be identical, but the relevant summary changes based on the user’s information need.
Separate relevance from other summary metrics
Do not use one generic “summary quality” score unless you are doing a rough manual review. In production evals, split the dimensions:
- Relevance: Does the summary include the information needed for the task?
- Factuality: Are the included claims supported by the source?
- Completeness: Does the summary cover all required points?
- Conciseness: Does it avoid unnecessary detail?
- Style: Does it match the requested tone, format, or reading level?
A summary can be relevant but factually wrong. It can be factual but irrelevant. It can be well-written and still useless. Label these metrics separately so your team knows what to fix.
Start with the user’s information need
Before writing a relevance rubric, define the user’s goal in concrete terms. This should be part of each eval case, not an assumption hidden in someone’s head.
A good eval case includes:
- Source text: The document, transcript, ticket, email, or retrieved context being summarized.
- User task: What the user wants to know or do.
- Required information: Details the summary must include to be useful.
- Irrelevant information: Details that should be omitted or deprioritized.
- Generated summary: The model output being evaluated.
- Score: The relevance score assigned by a human or calibrated LLM judge.
- Rationale: A short explanation for the score.
Example: source text and generated summaries
Here is a small example using a customer support transcript.
Source text
Customer: We upgraded to version 4.8 yesterday. Since then, CSV exports from the analytics page fail for reports with more than 10,000 rows. Smaller exports work. The request spins for about 90 seconds, then returns a 504. This is blocking our finance team’s month-end reporting. We are on the Enterprise plan and use the EU region. We tried Chrome and Firefox. Same result.
Support agent: Thanks. Can you confirm whether the issue happens for scheduled exports too?
Customer: Scheduled exports still work. It is only manual exports from the analytics page.
User task
A support engineer needs a summary to triage the bug and reproduce the issue.
Generated summary A
The customer is experiencing export failures after upgrading to version 4.8. Manual CSV exports from the analytics page fail for reports over 10,000 rows after about 90 seconds with a 504 error. Smaller exports and scheduled exports still work. The customer is on the Enterprise plan in the EU region, and the issue occurs in both Chrome and Firefox. The problem is blocking finance month-end reporting.
Generated summary B
The customer is unhappy because finance reporting is blocked after a recent upgrade. They are an Enterprise customer and need help quickly.
Generated summary C
The customer reports that all exports are failing after upgrading to version 4.8. The issue affects their finance team and happens in Chrome.
Summary A is the most relevant for the support engineer. It includes the failure condition, threshold, page, export type, timeout behavior, error code, region, browsers tested, and scheduled export exception. Summary B is fluent but misses reproduction details. Summary C includes some relevant details but distorts the scope by saying all exports fail.
Use a rubric with clear score levels
A useful relevance rubric should define each score level in terms of task success. Avoid vague criteria like “good,” “mostly relevant,” or “captures the main idea.” Those labels create inconsistent scoring across reviewers and LLM judges.
Use a 1 to 5 scale when you need enough detail to track improvements without making scoring slow.
| Score | Label | Criteria | Example judgment |
|---|---|---|---|
| 5 | Highly relevant | Includes all task-critical details and omits distractions. The user can act on the summary without rereading the source. | Includes export type, page, row threshold, error code, timeout, version, region, browser tests, and scheduled export exception. |
| 4 | Relevant | Includes most task-critical details. Missing detail causes minor friction but does not block the user. | Includes the 504 error and row threshold but omits browser tests. |
| 3 | Partially relevant | Includes some useful information but misses one or more details needed for the task. | Mentions export failure after upgrade but omits threshold, manual-only scope, and region. |
| 2 | Low relevance | Focuses on secondary details while missing most task-critical information. | Focuses on customer urgency and plan tier but misses reproduction conditions. |
| 1 | Irrelevant | Does not answer the user’s information need or provides details unrelated to the task. | Summarizes customer sentiment without describing the bug. |
For a blog post, internal doc, or eval report, include a screenshot or table of your rubric. Teams make fewer scoring mistakes when the score definitions are visible next to each scored output.
Score example summaries with the rubric
Using the source text above, the scored rows might look like this:
| Summary | Relevance score | Rationale |
|---|---|---|
| Summary A | 5 | Contains all details needed for support triage: version, manual CSV exports, analytics page, row threshold, timeout, 504 error, Enterprise plan, EU region, browser tests, and scheduled export exception. |
| Summary B | 2 | Mentions urgency and customer tier but omits the core reproduction details needed by the engineer. |
| Summary C | 3 | Includes the upgrade, export failure, finance impact, and browser, but misses the threshold and incorrectly broadens the issue to all exports. |
This format works well in an eval UI because reviewers can scan the output, score, and rationale in one row. If you use LLM-as-judge scoring, save the judge rationale too. It helps you debug weak criteria and spot judge drift.
Build an eval dataset for summary relevance
Your eval dataset should reflect the summaries your application actually generates. Include short, medium, and long source texts. Include easy cases and edge cases. If your app uses retrieval, include cases where the retrieved context contains distractors.
Here is an example dataset structure:
| Field | Example value |
|---|---|
| case_id | support_export_bug_001 |
| source_text | Customer transcript describing CSV export failures after version 4.8 upgrade. |
| user_task | Summarize for a support engineer who needs to reproduce and triage the bug. |
| required_details | Version 4.8, manual CSV exports, analytics page, reports over 10,000 rows, 90-second timeout, 504 error, EU region, scheduled exports still work. |
| distractor_details | Customer emotion, generic urgency, browser names if token budget is very tight. |
| candidate_summary | The generated model output. |
| human_relevance_score | 1 to 5 |
| human_rationale | Short explanation of missing or included task-critical details. |
Start with 30 to 50 examples for early prompt iteration. Move toward 100 to 300 examples when you need stable regression testing across prompt versions, model changes, retrieval changes, or agent workflow updates.
Use LLM judges carefully
LLM judges can score summary relevance quickly, but you need calibration. Do not trust judge scores by default.
A practical setup looks like this:
- Create a small gold set of 30 to 50 examples scored by humans.
- Write a judge prompt that includes the user task, source text, candidate summary, rubric, and output schema.
- Compare LLM judge scores against human scores.
- Review disagreements by score band and failure type.
- Revise the rubric or judge prompt where the judge makes repeatable mistakes.
- Run periodic spot checks after model or prompt changes.
For many teams, a judge is useful when it agrees with human labels within 1 point on at least 80% of cases. If your judge disagrees heavily on high-impact cases, keep humans in the review loop for those cases or narrow the judge’s scope.
Example LLM judge prompt
You can adapt this structure for automated relevance scoring:
You are evaluating summary relevance.
User task:
{{user_task}}
Source text:
{{source_text}}
Candidate summary:
{{candidate_summary}}
Rubric:
5 = Includes all task-critical details and omits distractions.
4 = Includes most task-critical details. Missing detail causes minor friction.
3 = Includes some useful information but misses details needed for the task.
2 = Focuses on secondary details while missing most task-critical information.
1 = Does not answer the user's information need.
Evaluate relevance only. Do not score factuality, style, grammar, or conciseness unless those issues affect relevance to the user task.
Return JSON:
{
"score": 1-5,
"rationale": "Brief explanation",
"missing_critical_details": [],
"irrelevant_details": []
}Keep the judge focused. If you want factuality, style, or conciseness, create separate judge calls or separate fields in the same judge output.
Track before and after scores
Summary relevance evals are most useful when they guide prompt and workflow changes. Always compare scores before and after a change.
For example, suppose your current prompt says:
Summarize this customer conversation in 3 bullets.You might change it to:
Summarize this customer conversation for a support engineer triaging a bug.
Include:
- Product area
- Triggering action
- Error message or code
- Scope of affected users or records
- Environment details
- Any known workaround or exception
Omit generic sentiment unless it affects severity.Your before and after comparison might look like this:
| Eval set | Prompt version | Average relevance score | % scored 4 or 5 | % scored 1 or 2 |
|---|---|---|---|---|
| Support triage summaries, 80 cases | v1 generic summary | 3.1 | 42% | 18% |
| Support triage summaries, 80 cases | v2 task-specific summary | 4.2 | 76% | 5% |
Include this kind of before and after table in your internal eval report. It gives reviewers, product managers, and engineering leads a concrete view of whether the change improved task performance.
Common mistakes when evaluating summary relevance
Judging fluency instead of relevance
A polished summary can miss the user’s actual need. If the task is bug triage, a beautiful executive summary is still a bad output when it omits reproduction steps.
Using vague criteria
Criteria like “captures important points” are too broad. Important to whom? For what task? Replace vague criteria with required information tied to the user’s goal.
Ignoring the user’s information need
The same source text can support many summary types. A compliance reviewer, support engineer, sales rep, and executive may need different details. Put the user task directly in each eval case.
Over-trusting LLM judges without calibration
LLM judges can prefer longer summaries, fluent wording, or their own assumptions about importance. Compare judge scores with human labels before using them as a release gate.
Mixing relevance with factuality or style without labeling them
If a summary includes the right details but invents one number, that is a factuality failure. If it includes all required details but sounds too casual, that is a style issue. Track these separately so your fixes are precise.
How to operationalize relevance evals
For production LLM systems, run relevance evals at the points where quality can change:
- Prompt edits: Check whether new instructions improve or reduce task-specific relevance.
- Model changes: Compare relevance scores before moving to a new model or model version.
- Retrieval changes: Test whether new chunks, ranking logic, or filters affect summary relevance.
- Agent workflow changes: Verify that upstream tool calls provide the right source context for the final summary.
- Production monitoring: Sample real outputs and score them against the same rubric.
A good release gate might require no drop in average relevance score, no increase in 1 or 2 scores, and manual review for any critical case that drops by 2 or more points.
Recommended artifacts for your eval report
When you share relevance eval results, include these artifacts:
- Rubric table: Show score levels, labels, and criteria.
- Eval dataset example: Include source text, user task, required details, candidate summary, score, and rationale.
- Scored output table: Show several summaries with scores and reviewer notes.
- Before and after comparison: Show average score, percentage of strong outputs, and percentage of weak outputs.
- Failure categories: Track patterns such as missing required details, over-including distractors, wrong audience, or generic summary behavior.
Screenshots are useful when your eval tool shows traces, prompt versions, model settings, retrieved context, and judge outputs together. They help engineers connect a bad relevance score to the specific prompt, context, or workflow step that caused it.
Summary relevance evaluation checklist
- Define the user task for every eval case.
- List required details and distractor details.
- Use a clear 1 to 5 relevance rubric.
- Score relevance separately from factuality, style, and conciseness.
- Include rationales with scores.
- Calibrate LLM judges against human labels.
- Track before and after scores for prompt, model, retrieval, and workflow changes.
- Review low-scoring cases and group them by failure type.
Summary relevance is a task-level metric. It tells you whether the summary gave the user the information they needed. When you define the user need clearly, score with a concrete rubric, and track changes over time, relevance evals become a practical tool for improving LLM applications.
PromptLayer helps AI teams manage prompts, run evaluations, inspect traces, compare versions, and track quality changes across LLM workflows. If you are building summary evals or regression tests for production prompts, create a PromptLayer account to start organizing and evaluating your LLM outputs.