Back

How to Test an LLM App Before Launch

Jun 05, 2026
How to Test an LLM App Before Launch

Pre-launch testing for an LLM app should answer one question: can this workflow behave correctly when real users, messy inputs, changing context, and model variance hit it at the same time?

If your test plan only checks five polished examples in a notebook, you are not testing the app. You are testing a demo. Production LLM systems need regression coverage, trace replay, rubric-based scoring, and release gates tied to real failure modes.

Start by defining what “correct” means

Before you write evals, define the contract for your LLM app. This contract should describe what the system must do, what it must refuse, what tools it may call, and what output shape downstream code expects.

For example, a support triage agent might have this contract:

  • Classify the ticket into one of 12 allowed categories.
  • Extract account ID, urgency, product area, and requested action when present.
  • Never invent policy details when the knowledge base does not contain an answer.
  • Call the refund eligibility tool before suggesting a refund.
  • Return valid JSON that matches the production schema.
  • Escalate if the user mentions legal action, self-harm, payment fraud, or account takeover.

This turns vague quality feedback into testable behavior. It also prevents the team from arguing over whether an output “feels good” after a prompt change.

Version the prompt, model settings, and workflow before testing

Freeze the candidate build before you run launch tests. Store the prompt, system message, model, temperature, retrieval settings, tool definitions, routing logic, and output parser together.

If you change the prompt halfway through testing, restart the relevant evals. Otherwise, your results mix multiple candidates and tell you very little.

At minimum, record:

  • Prompt version or commit hash
  • Model name and provider
  • Temperature, max tokens, top-p, and seed if supported
  • Tool schemas and tool descriptions
  • Retrieval index version and chunking settings
  • Eval dataset version
  • Evaluator version, including rubric changes

This matters when a launch candidate fails and you need to know whether the regression came from the prompt, model, retrieval layer, or evaluator.

Build an eval dataset that looks like production

A useful pre-launch dataset includes normal cases, edge cases, policy-sensitive cases, and examples pulled from production-like traces. Do not rely on a tiny set of handpicked happy paths.

A practical starting size for many teams:

  • 20 to 50 smoke tests: fast checks for schema validity, basic routing, and obvious failures.
  • 100 to 300 regression examples: common user requests, known bugs, and core product flows.
  • 50 to 150 edge cases: ambiguous inputs, missing context, adversarial phrasing, policy boundaries, and low-quality documents.
  • 500 or more trace replay examples: sampled from beta users, internal dogfooding, or historical logs when available.

For each row, store more than the input. Include the expected behavior, unacceptable behavior, metadata, and any context fixture needed to run the example repeatably.

Example eval dataset schema

  • id: unique test case ID, such as refund_policy_042
  • input: user message or conversation state
  • context_fixture: documents, retrieved chunks, user profile, or tool responses
  • expected_behavior: what a good answer must do
  • must_not_do: hallucinations, forbidden claims, unsafe actions, or invalid tool calls
  • tags: billing, RAG, tool-call, escalation, multilingual, adversarial
  • severity: P0, P1, P2, or P3
  • golden_output: optional, useful for extraction and classification tasks

For generative answers, avoid strict string matching unless the task requires exact output. Score the behavior instead. A correct answer can use different wording.

Recommended visual: include a screenshot of a sample eval dataset with 8 to 10 rows. Show columns for input, expected behavior, must-not-do behavior, tags, severity, and pass status.

If your team is formalizing this process, use a shared definition of LLM evaluation so prompt changes, model upgrades, and release decisions use the same language.

Write rubrics that penalize real failures

Rubrics should map directly to your product contract. A generic “quality: 1 to 5” score is weak because different reviewers will score the same output differently.

Use criteria that force a decision. For a RAG support assistant, the rubric might score:

  • Grounding: answer uses only provided context and does not invent policy.
  • Completeness: answer addresses every user request.
  • Refusal behavior: answer refuses or escalates when policy requires it.
  • Tool correctness: tool calls use the right tool, arguments, and order.
  • Output validity: response matches the required JSON schema or UI contract.
  • Tone constraints: answer follows your product’s support voice without adding unsafe promises.

Example rubric

  • 5: Fully correct. Uses supported facts, follows all instructions, and returns valid output.
  • 4: Minor issue that does not affect user trust, safety, or downstream execution.
  • 3: Partially correct but missing a required detail or has unclear reasoning.
  • 2: Incorrect or incomplete in a way that could confuse the user or break the workflow.
  • 1: Unsafe, unsupported, policy-violating, invalid, or actionably wrong.

Then define the pass rule. For example: “Pass if score is 4 or 5, unless the output violates must-not-do behavior. Any must-not-do violation is an automatic fail.”

Recommended visual: include a rubric screenshot with the scoring scale, pass rule, automatic failure conditions, and 2 example outputs with reviewer notes.

Use LLM-as-judge carefully

An evaluator model can grade large datasets faster than manual review, but you need calibration. Do not set thresholds based on vibes.

Start with 50 to 100 examples reviewed by engineers, product owners, or domain experts. Include clear passes, clear failures, and borderline outputs. Run your evaluator on the same set and compare.

Track:

  • False passes, where the evaluator approves an output your reviewers reject
  • False fails, where the evaluator rejects an acceptable output
  • Agreement rate by tag, such as RAG, tool calling, safety, or formatting
  • Agreement rate by severity, especially P0 and P1 cases

If the evaluator passes unsafe or factually wrong outputs, tighten the rubric and add examples. If it fails acceptable variants, clarify the expected behavior and remove over-specific wording.

For high-risk flows, require manual review for P0 failures and borderline cases. For lower-risk flows, an evaluator can run as a fast regression check. The key is to calibrate the judge before you trust the score.

Teams using LLM-as-a-judge should treat the judge prompt like production code. Version it, test it, and review changes before using it as a release signal.

Test the full workflow, not the final answer only

LLM apps often fail before the final response. A user sees a bad answer, but the root cause may be retrieval, tool arguments, routing, memory, or output parsing.

For each workflow, assert intermediate behavior:

  • Routing: Did the request go to the correct chain, agent, or model?
  • Retrieval: Did the system retrieve the right documents or chunks?
  • Tool selection: Did the model call the correct tool?
  • Tool arguments: Were IDs, dates, filters, and amounts correct?
  • State updates: Did the app write the right memory or database fields?
  • Output parsing: Did the response match the schema expected by your app?
  • Fallbacks: Did the app recover when context was missing or a tool failed?

For example, a travel booking agent may produce a friendly final message while calling the booking API with the wrong passenger count. Final-answer grading alone will miss that. Tool-call assertions will catch it.

If your app uses prompt chains, agent routing, or generated execution plans, document the intermediate steps. Concepts like an LLM compiler can be useful when teams need to reason about structured execution instead of a single prompt-response pair.

Replay real traces before launch

Production traces are one of the best sources of eval cases. They contain typos, missing information, strange user phrasing, long context, repeated requests, and tool errors that synthetic examples often miss.

If you have beta traffic, dogfood sessions, sales engineer demos, or historical chat logs, sample them into your eval set. Remove sensitive data before storing them. Keep the original structure when possible, including multi-turn state and tool responses.

Trace replay should answer:

  • Does the new prompt regress on cases the old prompt handled well?
  • Do latency and cost stay within your launch budget?
  • Do retrieved documents change in ways that affect answer quality?
  • Do tool calls remain stable after schema or prompt changes?
  • Do failures cluster around specific tags, users, or workflows?

Recommended visual: include a failing trace screenshot that shows the user input, prompt version, retrieved context, tool calls, model output, evaluator score, and failure reason.

Good LLM observability makes this practical. You need enough detail to reproduce the failure, assign ownership, and rerun the case after a fix.

Set release gates before you see the results

Define launch criteria before the team starts tuning prompts against the test set. Otherwise, thresholds will move whenever a favorite prompt version gets close.

A simple release gate might look like this:

  • P0 failures: 0 allowed
  • P1 pass rate: at least 98%
  • Overall regression pass rate: at least 95%
  • Schema validity: 99.5% or higher
  • Tool-call accuracy: 97% or higher on tagged tool cases
  • Median latency: under 2.5 seconds for non-streaming flows
  • p95 latency: under 8 seconds
  • Average cost: within 10% of budget
  • Reviewer calibration: evaluator agreement of 90% or higher on the calibration set

Adjust these numbers for your product risk. A writing assistant can tolerate more subjective variation than an agent that updates billing settings or files support tickets.

Recommended visual: include a release-gate screenshot with each criterion, current result, pass/fail status, owner, and link to failing cases.

Run adversarial and messy-input tests

Users will not follow your ideal path. Add tests for broken formatting, prompt injection, conflicting instructions, unclear requests, and unsupported languages.

Useful cases include:

  • A user asks the model to ignore the system prompt.
  • A retrieved document contains malicious instructions.
  • A user provides two conflicting account IDs.
  • A tool returns a timeout, empty response, or malformed JSON.
  • A user asks for policy exceptions that your company does not allow.
  • A conversation exceeds the normal context length.
  • A user changes intent halfway through a multi-turn flow.

For agents, test action boundaries. If the agent can create, update, delete, purchase, refund, or send messages, add explicit cases where it must ask for confirmation or refuse.

Check latency, cost, and rate limits under realistic load

A launch candidate can pass quality evals and still fail operationally. Run load tests against the same path users will hit, including retrieval, tool calls, model calls, streaming, retries, and logging.

Measure:

  • Median, p90, p95, and p99 latency
  • Timeout rate
  • Retry rate
  • Provider error rate
  • Token usage per request
  • Cost per successful task
  • Queue time if requests run asynchronously

Use realistic concurrency. If you expect 200 users during launch week, test bursts at 1x, 2x, and 3x expected traffic. Include worst-case prompts and long-context cases, not only short examples.

Compare against a baseline

Every launch candidate needs a baseline. The baseline can be the previous prompt, a simpler deterministic implementation, a human-reviewed answer set, or your current production model.

Track deltas, not only absolute scores:

  • Which tags improved?
  • Which tags regressed?
  • Did cost increase?
  • Did latency change?
  • Did the system become more verbose or less grounded?
  • Did tool-call behavior become less stable?

A prompt that improves overall score by 2% may still be unsafe if it introduces one P0 failure. A model upgrade that improves reasoning may fail your JSON contract more often. Baselines help you catch those tradeoffs before users do.

Common mistakes to avoid

  • Testing only happy paths: polished examples hide retrieval misses, refusals, tool errors, and ambiguous inputs.
  • Using tiny datasets: 10 examples cannot represent a real production surface area.
  • Relying on vibes: “This answer seems better” does not work as a release process.
  • Changing prompts without regression tests: prompt edits often fix one case and break another.
  • Ignoring production traces: real user behavior should feed your eval set.
  • Setting thresholds without reviewer calibration: an 85% eval score means little if the evaluator disagrees with your reviewers on high-risk cases.
  • Scoring only the final answer: agents and RAG apps fail inside retrieval, routing, and tool calls.
  • Mixing dataset versions: changing the test set without recording it makes results hard to compare.

A practical pre-launch checklist

  1. Define the product contract for the LLM workflow.
  2. Freeze the candidate prompt, model settings, retrieval config, and tool schemas.
  3. Create a versioned eval dataset with normal, edge, adversarial, and trace replay cases.
  4. Write rubrics tied to real failure modes.
  5. Calibrate evaluator scores against 50 to 100 reviewed examples.
  6. Assert intermediate workflow behavior, including retrieval and tool calls.
  7. Replay production-like traces and investigate clustered failures.
  8. Run latency, cost, and rate-limit tests under expected traffic.
  9. Compare results against a baseline.
  10. Apply a release gate with clear pass and fail criteria.
  11. Store failing cases back into the regression set after each fix.

The goal is not to make the LLM perfect. The goal is to make failure visible, repeatable, and bounded before launch. When you can reproduce failures, measure regressions, and block risky changes, you can ship faster without guessing.


PromptLayer helps AI teams manage prompts, eval datasets, traces, and release decisions in one workflow. If you are testing an LLM app before launch, create a PromptLayer account at https://dashboard.promptlayer.com/create-account.

The first platform built for prompt engineering