How to Run an OpenAI Eval
How to Run an OpenAI Eval
An OpenAI eval should answer one practical question: did this prompt, model, or agent change make the system better or worse on the cases you care about?
For LLM apps, that usually means testing real tasks: support routing, SQL generation, extraction, summarization, tool use, refusal behavior, or agent handoffs. A useful eval has saved inputs, expected behavior, a scoring method, and run artifacts your team can inspect after a failure.
This guide shows a compact workflow using an OpenAI eval-style project structure. You can adapt it to the OpenAI Evals API, the open-source OpenAI evals runner, or your internal eval harness.
1. Define the behavior you want to test
Start with one workflow. Avoid starting with “evaluate the chatbot.” That is too broad.
Use a concrete target like:
- Classify support tickets into
billing,technical,security, orother. - Extract invoice fields as valid JSON.
- Answer policy questions using only provided context.
- Call the correct tool in an agent workflow.
- Refuse requests that violate your product rules.
For this example, assume you are testing a support ticket classifier.
2. Create an eval folder
Keep evals close to the code they protect. A simple structure works well:
evals/
support_classifier/
README.md
eval.yaml
samples.jsonl
rubric.md
run.sh
results/
A screenshot helps here if you are documenting this for your team. Capture the eval folder structure in your repo so reviewers can see where samples, rubrics, and results live.
Use names that match the production workflow. Six months later, support_classifier will be easier to understand than eval_001.
3. Write representative test cases
Your eval is only as good as the cases inside it. Use real production examples when you can, with sensitive data removed. Include easy cases, edge cases, and cases that previously failed.
Example samples.jsonl:
{"input":[{"role":"system","content":"Classify the support ticket into one category: billing, technical, security, or other. Return only the category."},{"role":"user","content":"I was charged twice for my Pro subscription this month."}],"ideal":"billing","metadata":{"case_id":"billing_001","source":"prod_redacted","difficulty":"easy"}}
{"input":[{"role":"system","content":"Classify the support ticket into one category: billing, technical, security, or other. Return only the category."},{"role":"user","content":"Your API returns 502 errors when I upload files larger than 20MB."}],"ideal":"technical","metadata":{"case_id":"technical_001","source":"prod_redacted","difficulty":"medium"}}
{"input":[{"role":"system","content":"Classify the support ticket into one category: billing, technical, security, or other. Return only the category."},{"role":"user","content":"I found a way to access another customer's workspace by changing the URL."}],"ideal":"security","metadata":{"case_id":"security_001","source":"prod_redacted","difficulty":"critical"}}
{"input":[{"role":"system","content":"Classify the support ticket into one category: billing, technical, security, or other. Return only the category."},{"role":"user","content":"Can I get a copy of your SOC 2 report?"}],"ideal":"security","metadata":{"case_id":"security_002","source":"synthetic","difficulty":"medium"}}Do not run an eval with 5 hand-picked examples and trust the score. For a narrow classifier, start with at least 50 to 100 examples. For a high-risk workflow, use several hundred examples split by category and failure mode.
A good dataset usually includes:
- Common cases: the 20 to 50 inputs users send most often.
- Regression cases: prompts or inputs that broke before.
- Boundary cases: ambiguous tickets, missing fields, malformed input, long context.
- Safety cases: requests that should trigger refusal, escalation, or restricted output.
- Production samples: redacted examples pulled from real traces.
4. Choose deterministic checks before model-graded checks
Use deterministic scoring whenever possible. It is cheaper, faster, and easier to debug.
For the support classifier, exact match works:
expected: "billing"
actual: "billing"
score: 1For JSON extraction, validate the schema first:
{
"type": "object",
"required": ["invoice_id", "total", "currency", "due_date"],
"properties": {
"invoice_id": {"type": "string"},
"total": {"type": "number"},
"currency": {"type": "string", "enum": ["USD", "EUR", "GBP"]},
"due_date": {"type": "string"}
}
}Use model-graded evals when correctness is semantic, such as answer quality, groundedness, or policy compliance. Even then, pair the judge with clear rules and spot checks.
Example judge rubric:
Grade the model answer against the expected behavior.
Return PASS only if:
1. The answer uses only the provided context.
2. The answer directly addresses the user's question.
3. The answer does not invent facts.
4. The answer includes a refusal if the context does not contain enough information.
Return FAIL if:
1. The answer contains unsupported claims.
2. The answer skips the main question.
3. The answer exposes private or restricted information.
4. The answer is vague enough that a user could not act on it.
Return JSON only:
{"score": 0 or 1, "reason": "short explanation"}Do not rely only on model-graded scores. Judges can drift between model versions, miss subtle errors, or reward fluent wrong answers. For important workflows, review failed and passed examples manually during setup.
5. Add an eval config
Your config should make the eval reproducible. Track the task, dataset path, model, prompt version, temperature, and scorer.
Example eval.yaml:
name: support_classifier_v1
description: Classifies support tickets into billing, technical, security, or other.
dataset: samples.jsonl
model:
provider: openai
name: gpt-4o-mini
temperature: 0
max_output_tokens: 20
scoring:
type: exact_match
normalize:
trim: true
lowercase: true
metadata:
owner: ai-platform
prompt_name: support_classifier
prompt_version: 12
minimum_pass_rate: 0.92If your app already calls OpenAI through PromptLayer, you can keep prompt versions and request history tied to each run through the OpenAI integration. That makes it easier to compare eval results against the exact prompt and model settings used in production.
6. Run the eval
If you are using the open-source OpenAI evals runner, your command may look like this:
export OPENAI_API_KEY="sk-..."
oaieval gpt-4o-mini support_classifier_v1If you use a custom harness, keep the command explicit:
python -m evals.run \
--config evals/support_classifier/eval.yaml \
--samples evals/support_classifier/samples.jsonl \
--output evals/support_classifier/results/2026-06-05.jsonlYour output should include pass rate, counts, cost, latency, and failed case IDs:
Eval: support_classifier_v1
Model: gpt-4o-mini
Prompt version: 12
Samples: 100
Passed: 91
Failed: 9
Pass rate: 0.91
Mean latency: 742ms
Estimated cost: $0.18
Failed cases:
- security_002 expected=security actual=other
- billing_014 expected=billing actual=other
- technical_031 expected=technical actual=otherA screenshot helps here too. Capture the command output in the pull request when a prompt change affects quality, cost, or latency.
7. Save every input and output
Do not keep only the final score. Save the raw request, response, expected answer, scorer output, prompt version, model, and run ID.
Example result row:
{
"run_id": "support_classifier_2026_06_05_001",
"case_id": "security_002",
"prompt_version": 12,
"model": "gpt-4o-mini",
"temperature": 0,
"input": "Can I get a copy of your SOC 2 report?",
"expected": "security",
"actual": "other",
"score": 0,
"latency_ms": 681,
"usage": {
"input_tokens": 42,
"output_tokens": 3
},
"request_id": "req_abc123"
}This matters when a score drops. Without saved inputs and outputs, your team has to rerun the eval and hope the failure repeats. That wastes time and can hide nondeterministic behavior.
If you run on Azure OpenAI, keep deployment names and API versions in the result metadata. PromptLayer supports this setup through the Azure OpenAI integration.
8. Inspect failed examples before changing the prompt
Do not edit the prompt after seeing only the aggregate score. Read the failed cases first.
Group failures by cause:
- Instruction issue: the prompt does not define the category clearly.
- Dataset issue: the expected answer is wrong or inconsistent.
- Ambiguous input: the case needs an escalation label or tie-breaking rule.
- Output format issue: the model returns extra text around the answer.
- Model capability issue: the task needs a stronger model or different context.
For example, “Can I get a copy of your SOC 2 report?” may fail because the rubric never says security and compliance requests belong in security. Fix the label definition, then rerun the eval.
A screenshot helps when reviewing failed examples. Show the input, expected output, actual output, prompt version, and scorer reason in one view.
9. Change one variable at a time
A common mistake is changing the prompt and model in the same eval run. If the score improves, you will not know what caused it.
Use a small comparison matrix:
| Run | Prompt version | Model | Pass rate | Mean latency |
|---|---|---|---|---|
| baseline | 12 | gpt-4o-mini | 91% | 742ms |
| prompt change only | 13 | gpt-4o-mini | 94% | 751ms |
| model change only | 12 | gpt-4o | 96% | 1180ms |
This makes the tradeoff clear. A stronger model may score higher but cost more. A prompt change may get most of the gain without increasing latency.
10. Add evals to CI
Once the eval is stable, run it on every prompt or agent change. You do not need to run the full suite on every commit. Use tiers:
- Smoke eval: 10 to 25 cases, runs in under 2 minutes.
- PR eval: 50 to 150 cases, blocks risky prompt changes.
- Release eval: full dataset, includes cost and latency checks.
- Regression eval: failed production cases added over time.
Example CI step:
name: Support classifier eval
on:
pull_request:
paths:
- "prompts/support_classifier/**"
- "evals/support_classifier/**"
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install dependencies
run: pip install -r requirements.txt
- name: Run eval
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
python -m evals.run \
--config evals/support_classifier/eval.yaml \
--samples evals/support_classifier/samples.jsonl \
--output evals/support_classifier/results/pr-${{ github.event.number }}.jsonl
- name: Check threshold
run: python -m evals.check_threshold --file evals/support_classifier/results/pr-${{ github.event.number }}.jsonl --min-pass-rate 0.92Do not make every eval a hard blocker at first. Start by reporting results in the pull request. Once the dataset and scoring are trusted, enforce thresholds for critical workflows.
11. Use separate evals for agents
Agent evals need more than final-answer scoring. You often need to test tool calls, step order, retrieval quality, and termination behavior.
Track fields like:
- Which tool the agent called.
- Whether the tool arguments were valid.
- How many steps the agent used.
- Whether the agent stopped at the right time.
- Whether the final answer matches the tool result.
Example agent result:
{
"case_id": "refund_agent_018",
"expected_tool": "lookup_invoice",
"actual_tools": ["lookup_customer", "lookup_invoice"],
"valid_tool_args": true,
"max_steps_allowed": 4,
"actual_steps": 3,
"final_answer_score": 1,
"score": 1
}If you use OpenAI Agents, connect traces to eval runs so you can inspect tool calls and intermediate messages. PromptLayer has support for the OpenAI Agents SDK.
Common mistakes to avoid
Using too few test cases
A 95% score on 20 examples means one failure. That does not tell you much. Use enough examples to cover real categories and known failure modes.
Using unclear rubrics
“Good answer” is not a rubric. Write pass and fail criteria that another engineer can apply without guessing.
Trusting only model-graded scores
Use deterministic checks for format, exact values, schema validity, tool names, citations, and refusal triggers. Add model grading only where semantic judgment is required.
Mixing prompt and model changes
Change one variable per run. Track prompt version, model name, temperature, tools, retrieval config, and dataset version.
Not saving eval inputs and outputs
The final score is not enough. Save the full artifact so you can debug failures without rerunning the suite.
Treating an OpenAI Eval score as production monitoring
An eval run is an offline test. It does not replace production tracing, request logging, latency alerts, cost tracking, or drift detection. Use evals before release, then monitor live traffic after release.
A practical eval checklist
- Define one workflow and one success metric.
- Create at least 50 representative examples for the first version.
- Include real redacted production cases where possible.
- Use deterministic scoring before model grading.
- Write a clear rubric for any judge model.
- Save inputs, outputs, prompt version, model, scorer result, cost, and latency.
- Review failed examples before changing prompts.
- Change one variable per run.
- Add smoke evals to CI.
- Do not treat eval scores as live production monitoring.
Final take
Running an OpenAI eval is straightforward. Making it useful takes discipline. Use representative cases, clear scoring, saved artifacts, and prompt version tracking. The goal is not to get a clean score once. The goal is to catch regressions before users do.
PromptLayer helps AI teams track prompt versions, run evaluations, inspect failures, and connect eval results to real OpenAI requests. If you are building or shipping LLM applications, create a PromptLayer account and start tracking your prompts and evals in one place.