Defining x_i in LLM Evaluations: Expert Guide for AI Teams

How to Define x_i for LLM Evals

In LLM evaluation, x_i is the input for a single test case. It is the thing your system receives, processes, and turns into an output that you score.

That sounds simple until you evaluate a real LLM application. In production, the model rarely receives only a raw user message. It may receive a system prompt, retrieved documents, user profile data, conversation history, tool outputs, feature flags, model settings, and hidden routing decisions.

If your eval defines x_i as only “the user query,” your test may measure a different system than the one users experience.

A strong definition of x_i should capture the full input state needed to reproduce the model behavior you want to evaluate. For AI teams, that usually means treating x_i as an engineering artifact, not a math symbol.

The practical definition

For most LLM evals, define x_i like this:

x_i is the complete input bundle needed to run one evaluation case through the same prompt, workflow, retrieval path, tools, and model configuration used by the application.

That bundle may include:

The visible user input
System and developer messages
Prompt template variables
Conversation history
Retrieved documents or chunks
Tool schemas and tool responses
User, account, or tenant context
Locale, timezone, permissions, or plan tier
Model name and inference settings
Workflow or agent state

You do not need every field for every eval. You do need every field that can change the output or affect the score.

A simple example

Suppose you are evaluating a support chatbot for a billing product. A weak x_i might look like this:

{
  "user_query": "Why was I charged twice?"
}

That input is incomplete if the real application depends on account status, invoices, plan type, refund policy, and conversation history.

A better x_i might look like this:

{
  "user_query": "Why was I charged twice?",
  "conversation_history": [
    {
      "role": "assistant",
      "content": "I can help with billing questions. Can you confirm the charge date?"
    },
    {
      "role": "user",
      "content": "It happened yesterday."
    }
  ],
  "user_context": {
    "plan": "Pro",
    "billing_timezone": "America/New_York",
    "account_status": "active"
  },
  "retrieved_context": [
    {
      "doc_id": "billing_policy_v4",
      "chunk_id": "refunds_duplicate_charges_02",
      "text": "Duplicate card authorizations usually clear within 3 business days..."
    },
    {
      "doc_id": "invoice_9831",
      "text": "Invoice paid on 2026-05-19. Card authorization pending on 2026-05-20."
    }
  ],
  "prompt_version": "billing_agent_v12",
  "model": "gpt-4.1",
  "temperature": 0.2
}

This version gives your eval a real chance to reproduce the system behavior. It also lets you debug failures. If the answer is wrong, you can inspect whether the model misunderstood the policy, retrieval returned the wrong invoice, or the prompt failed to instruct the model to distinguish charges from authorizations.

Start with the decision your eval needs to support

Before you define x_i, decide what question the eval should answer. Common examples:

Did the new prompt improve support answer accuracy?
Did the model upgrade increase tool-call success?
Does the RAG pipeline answer questions using the right source documents?
Does the agent complete the task without unsafe actions?
Did a retrieval change cause regressions for enterprise users?

Each question implies a different input shape.

If you are testing prompt wording only, x_i may freeze retrieved context and tool responses. If you are testing retrieval quality, x_i may include the user query and corpus state, while retrieved chunks become part of the generated run data. If you are testing an agent, x_i may need the starting state, available tools, permissions, and target outcome.

This is where many evals go wrong. Teams build one generic spreadsheet with a “prompt” column and an “expected answer” column, then use it for every system change. That may work for a simple summarization prompt. It breaks quickly for RAG, agents, routing, and multi-step workflows.

Define x_i at the right system boundary

The most important choice is the boundary of the system under test.

Prompt-level eval

Use this when you want to test a prompt template with fixed variables.

x_i should include:

Template variables
System and developer instructions
Any fixed context passed into the prompt
Model settings

This is useful when you are comparing prompt versions and want to keep everything else stable.

RAG eval

Use this when retrieval quality and answer quality both matter.

x_i may include:

User question
Corpus version
Retriever configuration
Filters, tenant IDs, or permissions
Expected source documents, if you are scoring retrieval

If you are scoring the final answer, preserve the retrieved chunks that were actually sent to the model. Without them, you cannot reproduce the result or tell whether the model failed or retrieval failed.

Agent eval

Use this when the model can plan, call tools, read results, and continue.

x_i should include:

User request
Initial environment state
Available tools and tool schemas
Permissions and policy constraints
Mocked or recorded tool responses
Success criteria

For agent workflows, x_i is often closer to a scenario than a single message. If an agent books a meeting, the input needs calendar state, attendee availability, timezone, and constraints such as “do not schedule outside business hours.”

Separate inputs from expected outputs

Do not leak expected answers into x_i.

This mistake is common in classification, extraction, and support-answer evals. A test case may include a field named “gold_answer,” “correct_category,” or “ideal_response.” If that field gets passed into the prompt by accident, the eval result becomes meaningless.

Keep your eval record structured into separate fields:

Input: the data passed into the system
Reference: the expected answer, rubric, labels, or acceptable criteria
Output: the model or workflow response
Score: metric results, judge decision, or pass/fail status
Trace: intermediate calls, retrieval results, and tool use

A clean separation helps prevent accidental leakage. It also makes your evals easier to review in code and easier to run through an LLM evaluation workflow.

Do not mix unrelated tasks in one eval set

A good eval set has a clear job. Avoid mixing unrelated tasks such as summarization, billing support, SQL generation, and safety refusal into one score.

Mixed eval sets create noisy results. A prompt change may improve SQL generation and hurt support tone, yet the average score looks flat. Your team may ship a regression because the eval hides the failure.

Split eval sets by task and decision:

billing_support_accuracy
refund_policy_refusal
sql_generation_read_only
ticket_summarization_concise
agent_tool_call_success

Each set should have its own x_i schema, scoring method, and release threshold. For example, a billing support eval may require factual accuracy above 95%, while a summarization eval may allow more variation and use a rubric-based judge.

Include hard cases, not only happy paths

If x_i only contains clean, common, easy examples, your eval will overestimate production quality.

Add cases that represent real failure modes:

Ambiguous user requests
Missing context
Conflicting retrieved documents
Stale policy documents
Users asking for actions they are not allowed to take
Multi-turn corrections
Long inputs near context limits
Tool failures and empty tool responses
Edge-case formatting requirements

For a RAG support bot, include cases where the right answer is “I do not have enough information.” For an agent, include tool timeouts and invalid tool returns. For extraction, include partial records, unusual formatting, and null values.

A useful eval set should make your system fail before your users do.

Version prompts and datasets together

x_i is tied to the prompt and workflow that consume it. If you change the prompt template but keep the same dataset without tracking compatibility, you can create silent eval drift.

For example, prompt version 4 may expect this input:

{
  "question": "...",
  "context": "..."
}

Prompt version 7 may expect this input:

{
  "user_question": "...",
  "retrieved_chunks": [...],
  "customer_plan": "..."
}

If your eval runner maps fields loosely, both may run without errors. The results may still be invalid.

Track these together:

Prompt version
Dataset version
x_i schema version
Model and inference settings
Retriever or tool configuration
Scoring code or judge prompt version

This matters most when multiple engineers edit prompts, datasets, and eval logic at the same time. Treat the eval dataset as part of your application contract.

Preserve retrieval and tool context for reproducibility

Many LLM failures cannot be explained from the final prompt alone. If your app uses retrieval or tools, preserve the intermediate context.

For RAG, store:

Query used for retrieval
Retriever version and settings
Corpus or index version
Returned document IDs
Chunk text sent to the model
Chunk rank and scores, when available

For tool-using agents, store:

Tool schemas available at runtime
Tool calls requested by the model
Tool arguments
Tool responses
Errors, retries, and timeouts
Final state after tool execution

This data makes failures debuggable. It also helps you compare runs over time through LLM observability, especially when model behavior changes or retrieval indexes get updated.

Choose the right scoring method for the x_i shape

The way you define x_i affects how you score the output.

Use exact match or deterministic checks when the task has a narrow correct answer:

JSON schema validation
Required fields present
SQL query uses only approved tables
Classification label equals expected label

Use rubric-based scoring when the answer has valid variation:

Support answer correctness
Instruction following
Conciseness
Policy compliance
Source faithfulness

For open-ended answers, an LLM as a judge can work well if the rubric is specific and the judge receives the right evidence. For example, a faithfulness judge should see the final answer and the retrieved context, not the expected answer unless your scoring design calls for it.

A practical x_i checklist

Before you add an eval case, ask these questions:

What system boundary does this eval test?
Does x_i include every input the app uses to produce the output?
Is hidden context captured, such as user plan, permissions, locale, or conversation state?
Are expected answers stored outside the model input?
Can the run be reproduced 30 days later?
Are prompt, dataset, model, retrieval, and judge versions recorded?
Does this case belong in this eval set, or should it be split into another task?
Does the set include realistic failure cases?
Can an engineer inspect the trace and understand why the score changed?

If the answer to several of these is no, the eval may still run, but it will give weak release signal.

Recommended schema for an eval case

You can adapt this structure for many LLM applications:

{
  "id": "billing_duplicate_charge_001",
  "task": "billing_support_accuracy",
  "input": {
    "user_query": "Why was I charged twice?",
    "conversation_history": [],
    "user_context": {
      "plan": "Pro",
      "timezone": "America/New_York"
    },
    "retrieval": {
      "corpus_version": "billing_docs_2026_05_18",
      "filters": {
        "tenant": "public_docs"
      }
    }
  },
  "reference": {
    "expected_facts": [
      "One transaction is a pending authorization",
      "Pending authorizations usually clear within 3 business days",
      "The user should contact support if both charges post"
    ],
    "must_not_say": [
      "A refund has already been issued"
    ]
  },
  "run_config": {
    "prompt_version": "billing_agent_v12",
    "model": "gpt-4.1",
    "temperature": 0.2,
    "tools_enabled": false
  },
  "metadata": {
    "difficulty": "medium",
    "case_type": "duplicate_charge",
    "created_by": "evals_team",
    "dataset_version": "billing_support_eval_v3"
  }
}

The key pattern is separation. The model input lives under input. The expected behavior lives under reference. Runtime settings live under run_config. Search, filtering, slicing, and ownership fields live under metadata.

How x_i changes for prompt chains

Prompt chains and compiled workflows need extra care. A single user request may trigger multiple prompts, intermediate transformations, and tool calls. If you evaluate only the first input and final output, you may miss the failing step.

For chained systems, define x_i as the starting state for the chain, then preserve each intermediate step in the trace. If your workflow uses planning, retrieval, drafting, validation, and final response generation, store the inputs and outputs for each stage.

This becomes especially useful when teams work with structured prompt programs or an LLM compiler. Small changes in one step can change downstream behavior, so the eval case should make the execution path inspectable.

Good x_i definitions make evals useful

LLM evals fail when the test input does not match the production input. The model gets blamed for retrieval bugs. Prompt changes get credited for dataset leakage. Average scores hide task-specific regressions. Teams rerun old evals and cannot reproduce the context that created the original result.

Define x_i as the full input bundle for the system boundary you are testing. Keep references separate. Preserve retrieval and tool context. Version prompts and datasets together. Split evals by task. Add hard cases that match real production risk.

When you do that, your evals become useful release checks instead of loose demos.

PromptLayer helps AI teams manage prompts, datasets, evals, traces, and model runs in one workflow. If you are building LLM features and want cleaner evals with versioned inputs and reproducible runs, create a PromptLayer account.

How to Version Prompts for Production

How to Define x_i for LLM Evals

How to Define x_i for LLM Evals

The practical definition

A simple example

Start with the decision your eval needs to support

Define x_i at the right system boundary

Prompt-level eval

RAG eval

Agent eval

Separate inputs from expected outputs

Do not mix unrelated tasks in one eval set

Include hard cases, not only happy paths

Version prompts and datasets together

Preserve retrieval and tool context for reproducibility

Choose the right scoring method for the x_i shape

A practical x_i checklist

Recommended schema for an eval case

How x_i changes for prompt chains

Good x_i definitions make evals useful

How to Version Prompts for Production

How to Choose LLM Evaluation Metrics

How to Benchmark LLM Eval Frameworks

The first platform built for prompt engineering

Usage

Company

Follow Us

How to Define x_i for LLM Evals

How to Define xi for LLM Evals

The practical definition

A simple example

Start with the decision your eval needs to support

Define xi at the right system boundary

Prompt-level eval

RAG eval

Agent eval

Separate inputs from expected outputs

Do not mix unrelated tasks in one eval set

Include hard cases, not only happy paths

Version prompts and datasets together

Preserve retrieval and tool context for reproducibility

Choose the right scoring method for the xi shape

A practical xi checklist

Recommended schema for an eval case

How xi changes for prompt chains

Good xi definitions make evals useful

RECENT ARTICLES

The first platform built for prompt engineering

Usage

Company

Follow Us

How to Define x_i for LLM Evals

Define x_i at the right system boundary

Choose the right scoring method for the x_i shape

A practical x_i checklist

How x_i changes for prompt chains

Good x_i definitions make evals useful