Back

How to Run AI Software Development for LLM Apps

May 29, 2026
How to Run AI Software Development for LLM Apps

How to Run AI Software Development for LLM Apps

AI software development for LLM apps is software development with extra runtime uncertainty. Your code can pass unit tests while the model still gives a weak answer, calls the wrong tool, exceeds your budget, or fails on a user case your team never tried.

Teams that ship reliable LLM features treat prompts, model settings, context, tools, evals, traces, and releases as part of the engineering workflow. They version them. They test them. They review them. They monitor them after release.

This guide covers a practical operating model for teams building LLM-powered products, agents, prompt chains, and AI workflows.

1. Treat the LLM app as a system, not a prompt

A production LLM feature usually has more moving parts than the prompt itself:

  • Prompt templates and system instructions
  • Model choice, temperature, max tokens, and other parameters
  • Retrieved context, documents, memories, or user state
  • Tool schemas, function calls, APIs, and permissions
  • Output parsers and validators
  • Fallback behavior when the model or a tool fails
  • Cost limits, latency targets, and rate limits
  • Evaluation datasets and test cases
  • Traces, logs, and production feedback

If your team edits a prompt string inside application code and ships it like a copy change, you will lose track of what changed, why it changed, and whether it made the app better.

Run the LLM layer with the same discipline you use for application code: version control, review, testing, staging, release gates, and rollback.

2. Define the product behavior before you tune prompts

Before changing prompts, define what the feature should do in concrete terms. A vague goal like “answer support questions well” is hard to test. A better requirement gives your team expected behavior, failure behavior, and boundaries.

Example behavior spec for a support assistant

  • Primary task: Answer customer questions using only approved help center articles and account metadata.
  • Allowed actions: Search docs, check subscription plan, create support ticket.
  • Disallowed actions: Give legal advice, promise refunds, change account settings without confirmation.
  • Escalation rule: If confidence is low or the user asks about billing disputes, create a support ticket.
  • Latency target: First response under 4 seconds for 95% of requests.
  • Cost target: Average LLM cost under $0.03 per completed conversation.

This spec becomes the basis for prompts, evals, test cases, observability, and release review.

3. Version prompts and model settings together

One common mistake is treating prompts as unversioned strings. The prompt is part of your application logic. So are model settings, tool definitions, and retrieval settings.

Your team should know exactly which prompt version served a production request. You should also know who changed it, what changed, and which evals passed before release.

Prompt: support_answer_v7
Owner: AI Platform Team
Model: gpt-4.1-mini
Temperature: 0.2
Max output tokens: 700
Tools: search_help_center, get_subscription_plan, create_ticket

Version history:
v7  2026-01-14  Added billing escalation rule
    Author: Maya R.
    Eval result: 91.2% pass rate, up from 87.4%
    Released to: 25% traffic

v6  2026-01-08  Reduced answer length and added citations
    Author: Luis T.
    Eval result: 87.4% pass rate
    Released to: 100% traffic

v5  2025-12-18  Added tool call format instructions
    Author: Priya S.
    Eval result: 84.9% pass rate
    Released to: 100% traffic
Example: prompt version history

Prompt versioning prevents a painful class of bugs: “It worked last week, but no one knows what changed.” It also helps teams compare prompt iterations against the same dataset instead of relying on memory or a few manual examples.

4. Build an evaluation set before launch

Manual testing in a chat window is useful early, but it cannot be your release process. You need repeatable evals that catch regressions before users do.

Start with 50 to 100 cases for a new feature. Include realistic user inputs, expected behavior, edge cases, and examples that should fail safely. For high-risk flows, use more. A customer support agent, coding agent, or financial analysis assistant may need hundreds or thousands of cases over time.

If your team is new to eval design, start with the basics of LLM evaluation: define the task, create a dataset, select scoring methods, run tests repeatedly, and track results by prompt version.

Eval case types to include

  • Happy paths: Common requests the model should answer correctly.
  • Ambiguous inputs: Requests that require clarification.
  • Missing context: Questions where the correct response is “I do not have enough information.”
  • Tool failures: Search API timeout, bad JSON, missing account record, permission denied.
  • Adversarial prompts: Attempts to override system instructions or access restricted data.
  • Long context: Requests with long documents, chat history, or retrieved chunks.
  • Cost-heavy cases: Inputs likely to trigger long outputs, repeated tool calls, or large retrieval payloads.

Example: eval table for a support assistant release

Eval group Cases v6 pass rate v7 pass rate Required gate Status
Help center answers 120 89.2% 94.1% 90% Pass
Billing escalation 45 73.3% 91.1% 90% Pass
Prompt injection 60 96.7% 95.0% 95% Pass
Tool failure handling 50 78.0% 86.0% 90% Fail
Cost limit behavior 35 88.6% 91.4% 90% Pass

In this example, v7 should not go to 100% traffic. The team improved billing escalation, but tool failure handling still misses the release gate.

5. Use the right scoring method for each behavior

No single eval method covers every LLM behavior. Use different scoring methods for different checks.

  • Exact match: Best for structured outputs, labels, routing decisions, and enum values.
  • Schema validation: Best for JSON outputs and tool call arguments.
  • Reference comparison: Best when there is a known answer or approved response.
  • LLM-as-a-judge: Useful for tone, completeness, groundedness, and reasoning quality when deterministic checks are too rigid.
  • Custom code: Best for business rules, policy checks, cost checks, and safety filters.

Use LLM-as-a-judge carefully. It works best when you give the judge a clear rubric, examples of good and bad answers, and a constrained output format such as a score plus a short reason. Do not use it as the only gate for high-risk behavior.

Example judge rubric

Score the assistant response from 1 to 5.

Criteria:
1. Uses only the provided context.
2. Answers the user's question directly.
3. Includes a citation when making a factual claim.
4. Refuses or escalates when the context is insufficient.
5. Does not invent account details, policy terms, or refund promises.

Return JSON:
{
  "score": number,
  "pass": boolean,
  "reason": string
}

6. Test tool calls as first-class behavior

Many agent failures are tool failures. The model may call the wrong tool, pass malformed arguments, call tools in the wrong order, retry too often, or continue as if a failed tool succeeded.

Do not test tool calls only through end-to-end happy paths. Create specific tests for tool behavior.

Tool-call test cases

  • User asks for plan details, and the model must call get_subscription_plan before answering.
  • Search returns no results, and the model must ask a clarifying question.
  • Ticket creation API returns a 500, and the model must tell the user the request could not be completed.
  • Tool returns stale or incomplete data, and the model must avoid overclaiming.
  • User requests an action that requires confirmation, and the model must ask before calling the tool.
Trace ID: trc_7J4k9
User: "Cancel my subscription and refund the last charge."

Step 1: LLM call
Prompt version: support_agent_v7
Model: gpt-4.1-mini
Input tokens: 2,842
Output tokens: 188

Step 2: Tool call
Tool: get_subscription_plan
Arguments: {"customer_id":"cus_8129"}
Result: {"plan":"Pro","renewal_date":"2026-02-01"}

Step 3: Tool call
Tool: issue_refund
Arguments: {"customer_id":"cus_8129","amount":"last_charge"}
Result: error
Error: permission_denied

Step 4: LLM response
"I've canceled your subscription and processed the refund."

Failure:
The model claimed success after a permission_denied tool result.

Expected behavior:
Explain that the refund could not be processed automatically and create a support ticket.
Example: trace of a failed LLM call

This trace gives your team a concrete bug to fix. The issue is not “the model is bad.” The issue is that the prompt, tool policy, or post-tool validation allows the assistant to claim success after a failed tool call.

7. Add guardrails around cost and latency

LLM apps can fail financially before they fail technically. A prompt chain that works in a demo can become too expensive in production if it retrieves too much context, calls the model too many times, or allows long outputs for simple requests.

Set cost and latency budgets before launch.

Example budgets

  • Simple classification: under 800 input tokens, under 100 output tokens, under $0.001 per request.
  • Support answer: under 5,000 input tokens, under 800 output tokens, under $0.03 per conversation.
  • Research workflow: under 8 model calls, under 30 seconds, under $0.50 per completed report.
  • Coding agent task: under 20 tool calls, under 5 minutes, hard stop at a configured spend limit.

Build these limits into the application. Do not rely on a dashboard review after a bill arrives.

Cost controls to implement

  • Maximum number of model calls per user request
  • Maximum number of tool calls per workflow
  • Token limits per prompt stage
  • Cheaper model for routing, classification, and simple extraction
  • Timeouts and retry limits for model calls and tools
  • Per-tenant or per-user spend limits
  • Alerts when cost per task rises by 20% or more

8. Monitor LLM behavior, not only infrastructure

Infrastructure metrics tell you whether your service is up. They do not tell you whether your LLM app is giving correct answers, using tools safely, or drifting after a prompt change.

You need LLM observability that records the full request path: prompt version, model, parameters, retrieved context, tool calls, outputs, latency, token counts, cost, user feedback, and eval results.

Production metrics to track

  • Quality: user rating, thumbs-down rate, escalation rate, eval pass rate on sampled production traces.
  • Grounding: citation coverage, unsupported claim rate, retrieval miss rate.
  • Tool use: tool-call success rate, invalid argument rate, tool timeout rate, retry count.
  • Reliability: model error rate, fallback rate, JSON parse failure rate.
  • Cost: average cost per request, p95 cost per request, spend by feature and tenant.
  • Latency: p50, p95, and p99 latency by workflow stage.

Set alerts on behavior metrics. For example, a sudden rise in “tool permission denied followed by success claim” is more useful than a generic 200 OK rate.

9. Design prompt changes as small, reviewable changes

A prompt change should have a clear purpose. Avoid large rewrites that mix tone changes, policy changes, tool instructions, and output format changes in one release. If results improve or regress, you will not know which change caused it.

Example: before and after prompt iteration

You are a helpful support assistant.
Answer the user's question using the available tools.
If needed, create a support ticket.
Before: weak tool failure handling
You are a support assistant for account and billing questions.

Rules:
- Use help center context and account tools before answering account-specific questions.
- Never claim that an account action succeeded unless the tool result confirms success.
- If a tool returns permission_denied, timeout, or error, explain that the action could not be completed automatically.
- For refund disputes, billing errors, or failed account actions, create a support ticket when the user agrees.
- Do not promise refunds, credits, cancellations, or policy exceptions.
After: specific behavior for failed actions

The second prompt gives the model clear rules for a known failure mode. It is easier to test because the expected behavior is explicit.

10. Use release gates for LLM changes

Do not ship prompt changes directly to all users because a few manual tests looked good. Use release gates. Your gates should match the risk of the feature.

Example release checklist

LLM app release checklist

  • Prompt version has an owner, description, and change reason.
  • Model, parameters, tools, and retrieval settings are recorded.
  • Eval dataset ran against old and new versions.
  • All required eval groups passed release thresholds.
  • Tool failure tests passed, including timeout and permission errors.
  • Prompt injection and data access tests passed.
  • Cost and latency budgets passed on test runs.
  • Rollback version is selected and tested.
  • Production monitoring dashboard is filtered by prompt version.
  • Gradual rollout plan is set: 5%, 25%, 50%, 100%.

For low-risk copy generation, your release process can be lightweight. For an agent that takes user actions, handles regulated data, or writes code, use stricter gates.

11. Roll out gradually and compare versions in production

Offline evals are necessary, but production traffic will always reveal cases your dataset missed. Use gradual rollouts and compare versions against real usage.

A practical rollout plan:

  1. Staging: Run internal test cases and seeded edge cases.
  2. Dogfood: Send internal team traffic to the new version for 1 to 3 days.
  3. 5% rollout: Watch error rate, cost, latency, and user feedback.
  4. 25% rollout: Compare against the previous version on key metrics.
  5. 50% rollout: Run sampled production traces through evals.
  6. 100% rollout: Keep the previous version ready for rollback for at least one release cycle.

If you run multi-step prompt chains, you can also test specific stages. For complex chains, an LLM compiler approach can help teams reason about prompt workflows, dependencies, and execution plans more systematically.

12. Build a feedback loop from production to evals

Your eval dataset should grow from real failures. When a user gives negative feedback, a tool call fails, or an engineer finds a bad trace, convert it into a test case.

Production failure to eval case

Production issue:
User asked for a refund after a duplicate charge.
Assistant said a refund was issued.
Refund tool returned permission_denied.

New eval case:
Input: "I was charged twice. Refund the duplicate charge."
Mock tool result: permission_denied
Expected: Assistant explains it cannot process the refund directly and offers to create a support ticket.
Scoring:
- Must not claim refund success.
- Must mention support ticket.
- Must not promise refund approval.

This is how your system gets better over time. Every serious failure should become a regression test.

13. Assign clear ownership

LLM apps often fail when ownership is vague. Product owns the user outcome. Engineering owns runtime behavior. AI or platform teams may own shared prompt infrastructure, eval tooling, and observability. Legal, security, or support may own policy rules for specific domains.

Write this down for each LLM feature.

Example ownership map

Area Owner Review cadence
Prompt behavior Feature engineering team Every release
Eval dataset AI engineering team Weekly
Tool permissions Backend platform team Every API change
Support policy Support operations Monthly or policy change
Cost budget Engineering manager Weekly during rollout

Without clear owners, prompt fixes turn into scattered edits across code, docs, and chat threads.

14. Keep the workflow simple enough to follow

A good AI development process should reduce risk without slowing every change to a crawl. Start with the smallest process that catches your most common failures.

For many teams, this baseline works well:

  • Version every production prompt.
  • Keep at least 100 eval cases for each important LLM feature.
  • Run evals before every prompt or model change.
  • Trace every production request with prompt version, model, tools, cost, and latency.
  • Add failed production cases back into evals every week.
  • Use gradual rollout for high-risk changes.
  • Set hard cost, timeout, and retry limits.

As your app grows, add stronger review, larger eval datasets, automated red-team tests, and more detailed release gates.

Common mistakes to avoid

  • Treating prompts as unversioned strings: You lose history, ownership, and rollback.
  • Skipping evals: Manual testing misses regressions and edge cases.
  • Testing only happy paths: Real users trigger ambiguity, missing data, and policy conflicts.
  • Ignoring tool-call failures: Agents often fail when tools return errors, stale data, or permission denials.
  • Shipping without cost limits: A working workflow can still become too expensive.
  • Monitoring only infrastructure metrics: A 200 OK response can contain a bad answer.
  • Changing too much at once: Large prompt rewrites make regressions hard to diagnose.
  • Not saving production failures: If a bug does not become an eval case, it can return later.

A practical operating model

Here is a simple way to run AI software development for LLM apps:

  1. Write the behavior spec for the LLM feature.
  2. Create prompt versions with owners and change notes.
  3. Build an eval dataset with happy paths, edge cases, tool failures, and safety cases.
  4. Trace every model call, retrieval step, tool call, output, cost, and latency.
  5. Set release gates for quality, safety, cost, and latency.
  6. Roll out gradually and compare versions in production.
  7. Turn production failures into new eval cases.
  8. Review the system weekly while the feature is active.

This workflow gives your team a better chance of shipping LLM features that keep working after launch. It also gives engineers the evidence they need to improve the system without guessing.


PromptLayer helps AI teams manage prompt versions, run evals, trace LLM requests, debug failures, and ship LLM apps with more control. Create an account at https://dashboard.promptlayer.com/create-account.

The first platform built for prompt engineering