How to Run Your First LLM Eval
An LLM eval is a repeatable test that checks whether a prompt, model, agent, or workflow behaves the way you expect. It gives you a way to compare versions, catch regressions, and decide when a change is safe to ship. If you want a short definition first, see this overview of LLM evaluation.
Your first eval does not need to be complex. Start with 20 to 50 realistic examples, clear pass and fail rules, and a simple way to review results. You can make it more automated later.
1. Choose the behavior you want to evaluate
Do not start by trying to evaluate “quality.” That is too broad. Pick one behavior that matters for the feature you are shipping.
Good first eval targets include:
- Instruction following: Does the model follow the requested format and constraints?
- Factual accuracy: Does the answer match source documents or known facts?
- Classification quality: Does the model choose the right label?
- Tool usage: Does the agent call the right tool with valid arguments?
- Refusal behavior: Does the system reject unsafe or unsupported requests?
- Latency: Does the response arrive fast enough for the user experience?
For example, if you are building a support agent, your first eval might test whether the agent answers refund questions using the current refund policy and refuses to invent exceptions.
2. Define pass and fail criteria
Before you run the eval, write down what counts as a pass. This prevents you from moving the goalposts after seeing the output.
Use criteria that a developer or reviewer can apply consistently. For a support policy answer, your pass criteria might look like this:
- The answer cites the correct policy section.
- The answer does not promise a refund outside the stated window.
- The answer asks for missing required information when needed.
- The answer uses a helpful tone without adding unsupported details.
Your fail criteria might include:
- The model invents a policy rule.
- The model gives the opposite of the correct answer.
- The model ignores a required output schema.
- The agent calls the wrong tool or passes malformed tool arguments.
For your first eval, a binary pass or fail score is enough. You can add graded scores later, such as 1 to 5 for helpfulness or a separate score for citation accuracy.
3. Create a small golden dataset
A golden dataset is a curated set of test cases with expected behavior. Keep it small at first. A 30-case dataset is better than a 500-case dataset nobody trusts.
Each row should include enough information to run the prompt or workflow and judge the result. A simple structure works well:
- id: A stable test case ID, such as
refund_001. - input: The user request or conversation state.
- context: Retrieved docs, account metadata, tool results, or other required inputs.
- expected behavior: A short description of what the model should do.
- tags: Categories such as
refunds,edge_case, orschema.
Here is a compact example:
{
"id": "refund_001",
"input": "I bought this 45 days ago. Can I get a refund?",
"context": "Refunds are available within 30 days of purchase unless required by law.",
"expected_behavior": "Explain that the purchase is outside the standard refund window and avoid promising a refund.",
"tags": ["refunds", "policy", "boundary_case"]
}Use production-like examples when possible. Pull anonymized requests from logs, support tickets, QA notes, or bug reports. Include common cases and failure-prone edge cases. A useful first split is 70% common cases and 30% edge cases.
4. Run a baseline
Run your current prompt, model, or agent against the dataset before making changes. This gives you a baseline score.
At minimum, capture:
- The prompt version or agent version.
- The model name and settings, such as temperature and max tokens.
- The input and context used for each test case.
- The model output.
- Latency and token usage.
- Pass or fail result.
For latency-sensitive apps, track first-token latency as well as total response time. If the user sees a streaming interface, time to first token often matters more than total completion time.
Do not tune anything yet. The first run tells you where you are starting.
5. Choose how you will grade outputs
You have three common grading options: manual review, code-based checks, and model-based judging. Most teams use a mix.
Manual review
Manual review is the best starting point for subjective behavior. A reviewer reads each output and marks pass or fail using your criteria.
Use manual review when you are evaluating tone, policy compliance, user helpfulness, or reasoning quality. It is slower, but it helps you understand the failure modes.
Code-based checks
Code-based checks work well when the expected output has structure.
Examples include:
- Valid JSON schema.
- Required fields are present.
- Classification label matches one of the allowed labels.
- Tool arguments pass validation.
- The answer includes a citation ID.
These checks are fast, cheap, and deterministic. Use them whenever possible.
Model-based judging
A model-based judge uses another LLM to grade the output. This can work well for semantic checks, such as “does this answer contradict the source document?” or “does this response follow the refund policy?”
If you use LLM-as-a-judge, give the judge a strict rubric and ask for structured output. For example:
{
"pass": true,
"reason": "The answer correctly states that the purchase is outside the 30-day refund window and does not promise an exception."
}Spot-check judge decisions. If the judge is inconsistent, simplify the rubric or split one broad eval into several smaller checks.
6. Calculate a score you can act on
For a first eval, start with pass rate:
pass_rate = passing_cases / total_casesIf 24 out of 30 cases pass, your pass rate is 80%.
Then break results down by tag. A single aggregate score can hide serious problems. For example:
- Refund policy: 95% pass rate.
- Shipping questions: 90% pass rate.
- Edge cases: 55% pass rate.
- JSON schema: 100% pass rate.
This tells you where to work next. In this example, the prompt is mostly fine on common cases but weak on edge cases.
7. Inspect failures before changing the prompt
Read every failed case in your first eval. Group failures by cause before editing anything.
Common failure categories include:
- Ambiguous instruction: The prompt does not say what to do in a boundary case.
- Missing context: The model cannot answer because the retrieval step did not provide the right document.
- Conflicting context: The prompt, retrieved text, or tool output contains inconsistent information.
- Schema weakness: The prompt asks for JSON but does not define required fields or allowed values.
- Model limitation: The selected model struggles even when the prompt and context are clear.
- Agent planning error: The agent chooses the wrong step or skips a required tool call.
This step keeps you from overfitting the prompt. If retrieval is the real problem, prompt edits will only hide the issue for a few examples.
8. Make one change at a time
Change one variable, rerun the eval, and compare against the baseline. If you change the prompt, model, retrieval logic, and tool schema at the same time, you will not know what caused the score to move.
Common changes to test include:
- Adding clearer decision rules to the system prompt.
- Adding examples for boundary cases.
- Changing the output schema.
- Switching models.
- Reducing temperature for more consistent behavior.
- Improving retrieval filters or document chunking.
- Adding tool argument validation.
Keep a simple changelog. Record what changed, when you ran the eval, and the before-and-after score.
9. Set a shipping threshold
Decide what score is good enough before shipping. The threshold depends on the risk of the feature.
Example thresholds:
- Internal summarization tool: 85% pass rate may be acceptable if users can verify outputs.
- Customer support draft assistant: 90% to 95% pass rate may be needed, with review before sending.
- Autonomous refund approval agent: You may need 99%+ on policy compliance, strong tool validation, and narrow permissions.
Also define blocker failures. A model might have a 94% pass rate, but still fail release if it exposes private data, invents policy exceptions, or calls a destructive tool incorrectly.
10. Add the eval to your release workflow
Once the eval is useful, make it part of your development process. Run it when someone changes a prompt, swaps a model, updates retrieval logic, or edits an agent tool.
A practical release workflow looks like this:
- Developer edits the prompt or workflow.
- Eval runs against the golden dataset.
- Results are compared with the previous production version.
- Regressions are reviewed by tag and severity.
- The team ships only if the change meets the threshold.
You can run this manually at first. Later, add it to CI or your prompt deployment process.
11. Connect evals to production traces
Your golden dataset should improve over time. Production failures are one of the best sources of new eval cases.
When users report bad answers, convert those examples into test cases. When an agent calls the wrong tool, add that trace to the dataset. When a new product policy launches, add several cases for the new behavior.
This is where LLM observability becomes important. You need to see the prompt version, retrieved context, tool calls, model output, latency, and feedback for each production request. Without that trail, you will struggle to reproduce failures.
A simple first eval plan
If you want to start this week, use this plan:
- Pick one feature, such as a support answer generator or ticket classifier.
- Write 5 pass criteria and 5 fail criteria.
- Create 30 test cases from real or realistic inputs.
- Run your current prompt against all 30 cases.
- Manually grade every output as pass or fail.
- Group failures by cause.
- Make one prompt or workflow change.
- Rerun the eval and compare pass rate by tag.
- Set a release threshold.
- Add failed production examples back into the dataset.
This gives you a working eval loop without heavy infrastructure. The important part is repeatability. You should be able to answer, “Did this change make the system better, worse, or only different?”
Common mistakes to avoid
- Using only easy examples: Your eval should include common cases and known edge cases.
- Changing the dataset after every run: Keep a stable core set so scores remain comparable.
- Relying only on average score: Review failures by tag and severity.
- Letting the judge decide vague criteria: Give model-based judges narrow rubrics and structured output.
- Ignoring latency and cost: A more accurate prompt may still be a poor release if it doubles cost or feels slow.
- Testing prompts without testing the full workflow: For agents and RAG systems, retrieval, tools, and state handling can fail even when the prompt looks good.
What to measure after your first eval
After your first eval is running, add more measurements based on your application.
- Accuracy: Did the answer or action match the expected result?
- Faithfulness: Did the output stay grounded in the provided context?
- Format validity: Did the output match the required schema?
- Tool correctness: Did the agent call the right tool with valid arguments?
- Safety: Did the system refuse disallowed requests?
- Latency: How long did users wait for the first token and full response?
- Cost: How many input and output tokens did each run use?
Do not measure everything on day one. Add metrics when they help you make a release decision.
Final takeaway
Your first LLM eval should be small, specific, and repeatable. Pick one behavior, define pass and fail criteria, build a trusted dataset, run a baseline, inspect failures, and rerun the same tests after each change.
This process turns prompt iteration into engineering work you can review, compare, and ship with more confidence.
PromptLayer helps AI teams manage prompts, datasets, evals, traces, and production feedback in one workflow. If you are ready to build your first reliable eval loop, create a PromptLayer account.