Back

How to Define x_i for LLM Evals

Jun 03, 2026
How to Define x_i for LLM Evals

How to Define xi for LLM Evals

In LLM evaluation, xi is the input for a single test case. It is the thing your system receives, processes, and turns into an output that you score.

That sounds simple until you evaluate a real LLM application. In production, the model rarely receives only a raw user message. It may receive a system prompt, retrieved documents, user profile data, conversation history, tool outputs, feature flags, model settings, and hidden routing decisions.

If your eval defines xi as only “the user query,” your test may measure a different system than the one users experience.

A strong definition of xi should capture the full input state needed to reproduce the model behavior you want to evaluate. For AI teams, that usually means treating xi as an engineering artifact, not a math symbol.

The practical definition

For most LLM evals, define xi like this:

xi is the complete input bundle needed to run one evaluation case through the same prompt, workflow, retrieval path, tools, and model configuration used by the application.

That bundle may include:

  • The visible user input
  • System and developer messages
  • Prompt template variables
  • Conversation history
  • Retrieved documents or chunks
  • Tool schemas and tool responses
  • User, account, or tenant context
  • Locale, timezone, permissions, or plan tier
  • Model name and inference settings
  • Workflow or agent state

You do not need every field for every eval. You do need every field that can change the output or affect the score.

A simple example

Suppose you are evaluating a support chatbot for a billing product. A weak xi might look like this:

{
  "user_query": "Why was I charged twice?"
}

That input is incomplete if the real application depends on account status, invoices, plan type, refund policy, and conversation history.

A better xi might look like this:

{
  "user_query": "Why was I charged twice?",
  "conversation_history": [
    {
      "role": "assistant",
      "content": "I can help with billing questions. Can you confirm the charge date?"
    },
    {
      "role": "user",
      "content": "It happened yesterday."
    }
  ],
  "user_context": {
    "plan": "Pro",
    "billing_timezone": "America/New_York",
    "account_status": "active"
  },
  "retrieved_context": [
    {
      "doc_id": "billing_policy_v4",
      "chunk_id": "refunds_duplicate_charges_02",
      "text": "Duplicate card authorizations usually clear within 3 business days..."
    },
    {
      "doc_id": "invoice_9831",
      "text": "Invoice paid on 2026-05-19. Card authorization pending on 2026-05-20."
    }
  ],
  "prompt_version": "billing_agent_v12",
  "model": "gpt-4.1",
  "temperature": 0.2
}

This version gives your eval a real chance to reproduce the system behavior. It also lets you debug failures. If the answer is wrong, you can inspect whether the model misunderstood the policy, retrieval returned the wrong invoice, or the prompt failed to instruct the model to distinguish charges from authorizations.

Start with the decision your eval needs to support

Before you define xi, decide what question the eval should answer. Common examples:

  • Did the new prompt improve support answer accuracy?
  • Did the model upgrade increase tool-call success?
  • Does the RAG pipeline answer questions using the right source documents?
  • Does the agent complete the task without unsafe actions?
  • Did a retrieval change cause regressions for enterprise users?

Each question implies a different input shape.

If you are testing prompt wording only, xi may freeze retrieved context and tool responses. If you are testing retrieval quality, xi may include the user query and corpus state, while retrieved chunks become part of the generated run data. If you are testing an agent, xi may need the starting state, available tools, permissions, and target outcome.

This is where many evals go wrong. Teams build one generic spreadsheet with a “prompt” column and an “expected answer” column, then use it for every system change. That may work for a simple summarization prompt. It breaks quickly for RAG, agents, routing, and multi-step workflows.

Define xi at the right system boundary

The most important choice is the boundary of the system under test.

Prompt-level eval

Use this when you want to test a prompt template with fixed variables.

xi should include:

  • Template variables
  • System and developer instructions
  • Any fixed context passed into the prompt
  • Model settings

This is useful when you are comparing prompt versions and want to keep everything else stable.

RAG eval

Use this when retrieval quality and answer quality both matter.

xi may include:

  • User question
  • Corpus version
  • Retriever configuration
  • Filters, tenant IDs, or permissions
  • Expected source documents, if you are scoring retrieval

If you are scoring the final answer, preserve the retrieved chunks that were actually sent to the model. Without them, you cannot reproduce the result or tell whether the model failed or retrieval failed.

Agent eval

Use this when the model can plan, call tools, read results, and continue.

xi should include:

  • User request
  • Initial environment state
  • Available tools and tool schemas
  • Permissions and policy constraints
  • Mocked or recorded tool responses
  • Success criteria

For agent workflows, xi is often closer to a scenario than a single message. If an agent books a meeting, the input needs calendar state, attendee availability, timezone, and constraints such as “do not schedule outside business hours.”

Separate inputs from expected outputs

Do not leak expected answers into xi.

This mistake is common in classification, extraction, and support-answer evals. A test case may include a field named “gold_answer,” “correct_category,” or “ideal_response.” If that field gets passed into the prompt by accident, the eval result becomes meaningless.

Keep your eval record structured into separate fields:

  • Input: the data passed into the system
  • Reference: the expected answer, rubric, labels, or acceptable criteria
  • Output: the model or workflow response
  • Score: metric results, judge decision, or pass/fail status
  • Trace: intermediate calls, retrieval results, and tool use

A clean separation helps prevent accidental leakage. It also makes your evals easier to review in code and easier to run through an LLM evaluation workflow.

Do not mix unrelated tasks in one eval set

A good eval set has a clear job. Avoid mixing unrelated tasks such as summarization, billing support, SQL generation, and safety refusal into one score.

Mixed eval sets create noisy results. A prompt change may improve SQL generation and hurt support tone, yet the average score looks flat. Your team may ship a regression because the eval hides the failure.

Split eval sets by task and decision:

  • billing_support_accuracy
  • refund_policy_refusal
  • sql_generation_read_only
  • ticket_summarization_concise
  • agent_tool_call_success

Each set should have its own xi schema, scoring method, and release threshold. For example, a billing support eval may require factual accuracy above 95%, while a summarization eval may allow more variation and use a rubric-based judge.

Include hard cases, not only happy paths

If xi only contains clean, common, easy examples, your eval will overestimate production quality.

Add cases that represent real failure modes:

  • Ambiguous user requests
  • Missing context
  • Conflicting retrieved documents
  • Stale policy documents
  • Users asking for actions they are not allowed to take
  • Multi-turn corrections
  • Long inputs near context limits
  • Tool failures and empty tool responses
  • Edge-case formatting requirements

For a RAG support bot, include cases where the right answer is “I do not have enough information.” For an agent, include tool timeouts and invalid tool returns. For extraction, include partial records, unusual formatting, and null values.

A useful eval set should make your system fail before your users do.

Version prompts and datasets together

xi is tied to the prompt and workflow that consume it. If you change the prompt template but keep the same dataset without tracking compatibility, you can create silent eval drift.

For example, prompt version 4 may expect this input:

{
  "question": "...",
  "context": "..."
}

Prompt version 7 may expect this input:

{
  "user_question": "...",
  "retrieved_chunks": [...],
  "customer_plan": "..."
}

If your eval runner maps fields loosely, both may run without errors. The results may still be invalid.

Track these together:

  • Prompt version
  • Dataset version
  • xi schema version
  • Model and inference settings
  • Retriever or tool configuration
  • Scoring code or judge prompt version

This matters most when multiple engineers edit prompts, datasets, and eval logic at the same time. Treat the eval dataset as part of your application contract.

Preserve retrieval and tool context for reproducibility

Many LLM failures cannot be explained from the final prompt alone. If your app uses retrieval or tools, preserve the intermediate context.

For RAG, store:

  • Query used for retrieval
  • Retriever version and settings
  • Corpus or index version
  • Returned document IDs
  • Chunk text sent to the model
  • Chunk rank and scores, when available

For tool-using agents, store:

  • Tool schemas available at runtime
  • Tool calls requested by the model
  • Tool arguments
  • Tool responses
  • Errors, retries, and timeouts
  • Final state after tool execution

This data makes failures debuggable. It also helps you compare runs over time through LLM observability, especially when model behavior changes or retrieval indexes get updated.

Choose the right scoring method for the xi shape

The way you define xi affects how you score the output.

Use exact match or deterministic checks when the task has a narrow correct answer:

  • JSON schema validation
  • Required fields present
  • SQL query uses only approved tables
  • Classification label equals expected label

Use rubric-based scoring when the answer has valid variation:

  • Support answer correctness
  • Instruction following
  • Conciseness
  • Policy compliance
  • Source faithfulness

For open-ended answers, an LLM as a judge can work well if the rubric is specific and the judge receives the right evidence. For example, a faithfulness judge should see the final answer and the retrieved context, not the expected answer unless your scoring design calls for it.

A practical xi checklist

Before you add an eval case, ask these questions:

  • What system boundary does this eval test?
  • Does xi include every input the app uses to produce the output?
  • Is hidden context captured, such as user plan, permissions, locale, or conversation state?
  • Are expected answers stored outside the model input?
  • Can the run be reproduced 30 days later?
  • Are prompt, dataset, model, retrieval, and judge versions recorded?
  • Does this case belong in this eval set, or should it be split into another task?
  • Does the set include realistic failure cases?
  • Can an engineer inspect the trace and understand why the score changed?

If the answer to several of these is no, the eval may still run, but it will give weak release signal.

You can adapt this structure for many LLM applications:

{
  "id": "billing_duplicate_charge_001",
  "task": "billing_support_accuracy",
  "input": {
    "user_query": "Why was I charged twice?",
    "conversation_history": [],
    "user_context": {
      "plan": "Pro",
      "timezone": "America/New_York"
    },
    "retrieval": {
      "corpus_version": "billing_docs_2026_05_18",
      "filters": {
        "tenant": "public_docs"
      }
    }
  },
  "reference": {
    "expected_facts": [
      "One transaction is a pending authorization",
      "Pending authorizations usually clear within 3 business days",
      "The user should contact support if both charges post"
    ],
    "must_not_say": [
      "A refund has already been issued"
    ]
  },
  "run_config": {
    "prompt_version": "billing_agent_v12",
    "model": "gpt-4.1",
    "temperature": 0.2,
    "tools_enabled": false
  },
  "metadata": {
    "difficulty": "medium",
    "case_type": "duplicate_charge",
    "created_by": "evals_team",
    "dataset_version": "billing_support_eval_v3"
  }
}

The key pattern is separation. The model input lives under input. The expected behavior lives under reference. Runtime settings live under run_config. Search, filtering, slicing, and ownership fields live under metadata.

How xi changes for prompt chains

Prompt chains and compiled workflows need extra care. A single user request may trigger multiple prompts, intermediate transformations, and tool calls. If you evaluate only the first input and final output, you may miss the failing step.

For chained systems, define xi as the starting state for the chain, then preserve each intermediate step in the trace. If your workflow uses planning, retrieval, drafting, validation, and final response generation, store the inputs and outputs for each stage.

This becomes especially useful when teams work with structured prompt programs or an LLM compiler. Small changes in one step can change downstream behavior, so the eval case should make the execution path inspectable.

Good xi definitions make evals useful

LLM evals fail when the test input does not match the production input. The model gets blamed for retrieval bugs. Prompt changes get credited for dataset leakage. Average scores hide task-specific regressions. Teams rerun old evals and cannot reproduce the context that created the original result.

Define xi as the full input bundle for the system boundary you are testing. Keep references separate. Preserve retrieval and tool context. Version prompts and datasets together. Split evals by task. Add hard cases that match real production risk.

When you do that, your evals become useful release checks instead of loose demos.


PromptLayer helps AI teams manage prompts, datasets, evals, traces, and model runs in one workflow. If you are building LLM features and want cleaner evals with versioned inputs and reproducible runs, create a PromptLayer account.

The first platform built for prompt engineering