How to Do AI Prompt Engineering in LLM Apps
How to Do AI Prompt Engineering in LLM Apps
Prompt engineering in an LLM app is software engineering. You are designing an interface between your product, your data, your tools, your model, and your users. A good prompt is not a clever paragraph that works once in a notebook. It is a versioned, tested, observable part of your application.
If you are building customer support agents, coding assistants, document review workflows, sales copilots, or internal automation, your prompt needs to survive messy inputs, model updates, latency limits, cost pressure, and changing product requirements.
This guide covers a practical workflow for doing prompt engineering in production LLM apps.
Start with the application behavior, not the prompt
Before writing a prompt, define what the app must do. Treat the prompt as one part of the system design.
Write down:
- User intent: What is the user trying to accomplish?
- Inputs: What data will the model receive at runtime?
- Allowed actions: Can the model call tools, ask follow-up questions, draft text, or make decisions?
- Output contract: Does the app need JSON, Markdown, plain text, citations, tool calls, or a ranked list?
- Failure behavior: What should happen when context is missing, ambiguous, unsafe, or too large?
- Success metrics: What makes one response better than another?
For example, “answer customer questions” is too broad. A better spec is:
- Answer billing questions using only the retrieved help center articles and account metadata.
- Ask a clarifying question if the user refers to an invoice but no invoice ID is present.
- Return a short answer, then list the source article titles used.
- Never promise refunds. If the user asks for a refund, route to the billing tool.
That spec gives you something testable. It also prevents the prompt from becoming a hidden product requirements document.
Separate instructions, context, examples, and output format
Many weak prompts fail because they mix everything into one block. Keep the prompt structured so you can debug it.
Before
You are a helpful support assistant. Answer the customer using our policies. Be concise and friendly. If you do not know, say so. Return JSON. The customer is asking about a refund and here are some docs...After
SYSTEM:
You are a billing support assistant for Acme Cloud.
TASK:
Answer the user's billing question using only the provided policy excerpts and account metadata.
RULES:
- Do not invent refund eligibility.
- If the policy excerpts do not answer the question, say you need to route the issue to billing support.
- If the user requests a refund, do not approve or deny it. Return action_required: "billing_review".
- Keep the customer-facing answer under 120 words.
CONTEXT:
Account plan: {{plan}}
Account region: {{region}}
Policy excerpts:
{{retrieved_policy_docs}}
USER MESSAGE:
{{user_message}}
OUTPUT FORMAT:
Return valid JSON:
{
"answer": string,
"action_required": "none" | "billing_review" | "missing_information",
"sources": string[]
}This structure makes each part easier to test. If the model invents a refund policy, inspect the rules and retrieved documents. If the app breaks on parsing, tighten the output format. If the answer is too vague, improve the task or add examples.
Use examples, but do not optimize for one example
Few-shot examples can improve consistency, especially for classification, extraction, formatting, and policy decisions. The mistake is tuning the prompt until it passes one example while failing the next 20.
Use examples that represent real input patterns:
- A normal successful case
- A missing context case
- An ambiguous user request
- A policy boundary case
- A malformed or noisy input
Classification example
EXAMPLES:
User: "Can you cancel my subscription before the next renewal?"
Output:
{
"intent": "cancel_subscription",
"needs_tool": true,
"confidence": 0.94
}
User: "Why was I charged twice?"
Output:
{
"intent": "billing_dispute",
"needs_tool": true,
"confidence": 0.91
}
User: "Thanks, that fixed it."
Output:
{
"intent": "other",
"needs_tool": false,
"confidence": 0.88
}Keep examples short and targeted. If you need 25 examples to make one prompt behave, the task may need a classifier step, a retrieval step, or a narrower prompt.
Keep business logic out of long prompt paragraphs
Prompts often become the place where product logic goes to hide. That creates real problems. Engineers cannot easily test it, product teams cannot review it, and small edits can change behavior in production.
Do not bury rules like this:
Be helpful, but remember that enterprise users get premium support unless they are on trial, except if the issue is related to billing, in which case ask for an invoice ID, but do not ask for one if they already provided it...Move logic into code, config, or a policy table when possible:
{
"support_tier": "premium",
"is_trial": false,
"issue_type": "billing",
"required_fields": ["invoice_id"],
"missing_fields": []
}Then let the prompt reason over clean state:
Use the provided support state to decide the next response.
If missing_fields is empty, do not ask for more information.
If issue_type is billing and required data is present, prepare the billing support response.This is similar to feature engineering in traditional ML. You improve the model input by giving it clear, useful variables instead of raw, tangled state.
Break complex workflows into prompt chains
A single giant prompt is hard to debug. It also increases token cost, latency, and failure surface area. If the task has multiple decisions, split it into smaller steps.
A customer support agent might use this chain:
- Intent detection: Classify the request.
- Retrieval query generation: Create search queries for internal docs.
- Answer drafting: Write an answer using retrieved context.
- Policy check: Verify that the answer does not violate refund, legal, or security rules.
- Final response: Return the customer-facing message and required action.
This pattern is easier to evaluate because each step has a smaller job. If the answer is wrong, you can inspect whether the classifier failed, retrieval returned poor context, or the final prompt ignored the policy.
For multi-step LLM systems, use prompt chaining instead of asking one prompt to classify, retrieve, reason, format, and validate in a single call.
Design prompts around model limits
Every model has tradeoffs. A prompt that works well on GPT-4.1 may behave differently on Claude, Gemini, a smaller open model, or a fine-tuned model. Before shipping, define the model constraints that matter for your app.
- Context window: How much input can the model handle before retrieval quality or instruction following drops?
- Output length: Can the model reliably return the full object you need?
- Tool calling: Does the model call tools consistently under ambiguous conditions?
- Structured output: Does it return valid JSON at the rate your app requires?
- Latency: Can the model meet your user-facing response budget?
- Cost: Can the workflow run at expected production volume?
Concrete example: if your app has a 3 second response target, a 6-step agent with long context may fail even when each individual prompt looks good in testing. You may need a smaller model for classification, cached retrieval, shorter context, or an async workflow.
Use evals before changing prompts
Prompt edits without evals are guesses. You need a dataset that represents real production traffic and expected behavior.
Start with 30 to 100 examples for one workflow. Include normal cases and failure cases. For each example, store:
- The user input
- Runtime variables
- Retrieved context, if used
- The expected behavior
- Scoring criteria
- The prompt version and model version
Use a mix of deterministic checks and model-graded checks.
Deterministic checks
- JSON is valid.
- Required keys are present.
- Answer is under 120 words.
- Action is one of the allowed enum values.
- Sources are included when retrieval context is used.
Model-graded checks
- Does the answer follow the policy?
- Does the answer use only the provided context?
- Did the model ask for missing information when needed?
- Is the tone appropriate for a customer-facing support response?
A simple eval table might look like this:
| Test case | Expected behavior | Pass condition |
|---|---|---|
| User asks for refund with no invoice ID | Route to billing review and ask for invoice ID | action_required equals "missing_information" |
| User asks if trial includes premium support | Answer using plan policy only | No unsupported claims |
| User says "cancel it" | Ask what they want to cancel | No tool call until target is clear |
Do not ship a prompt change because it improves one demo. Ship it because it improves your eval set without causing regressions in critical cases.
Version prompts like code
A production prompt needs version history, review, rollback, and release notes. Store the prompt text, model parameters, runtime variables, and expected schema together.
For each prompt version, record:
- What changed
- Who changed it
- Which eval dataset was run
- Pass rate before and after
- Known tradeoffs
- Whether the change shipped to staging or production
Example release note:
Prompt: billing_support_answer
Version: 18
Change: Added explicit rule to route refund requests to billing_review.
Eval result: 91% to 96% pass rate on billing_support_eval_v4.
Regression: Slightly longer answers in 3 of 80 cases.
Decision: Ship to staging for trace review.This is where prompt management becomes important. A prompt edited in a local notebook, pasted into production, and forgotten will eventually create a debugging problem.
Trace every production request
When an LLM app fails, the final answer is rarely enough to debug the issue. You need the full trace.
Capture:
- Prompt version
- Model name and parameters
- User input
- Runtime variables
- Retrieved documents and scores
- Tool calls and tool outputs
- Intermediate model responses
- Final output
- Latency and token usage
- User feedback or downstream outcome
For a failure case, a useful trace might show that retrieval returned the wrong policy article, while the answer prompt followed that bad context correctly. Without the trace, the team may waste time rewriting the answer prompt.
If your app uses agents, traces are even more important. You need to see which tool the model selected, what arguments it used, whether the tool returned an error, and how the model recovered.
Use screenshots and artifacts during review
Prompt reviews work better when reviewers can see real behavior. For each meaningful prompt change, save artifacts that make the change easy to inspect:
- A prompt diff between the old and new version
- A trace for a passing case
- A trace for a failure case
- An eval result summary
- Three to five representative outputs before and after the change
For example, if you change a tool-calling prompt, include a trace screenshot showing the old version calling refund_customer too early and the new version returning billing_review instead. That gives reviewers concrete behavior, not just prompt text.
Treat prompts as part of application design
Prompt engineering should not sit apart from the rest of engineering. The prompt depends on your retrieval design, data model, tools, UI, error states, and product rules.
Ask these questions during design review:
- Can the UI collect missing information before calling the model?
- Can code enforce a rule instead of asking the model to remember it?
- Can retrieval return smaller, more relevant chunks?
- Can the model choose between a limited set of actions instead of free-form behavior?
- Can a cheaper model handle an early classification step?
- Can the app validate the model output before showing it to the user?
In some systems, prompts become a compilation target. Higher-level task specs, schemas, tools, and policies can generate prompt text for specific models. If your team is moving in that direction, it is worth understanding the idea of an LLM compiler.
A practical prompt engineering workflow
Use this workflow when building a new LLM feature:
- Write the behavior spec. Define inputs, outputs, allowed actions, and failure behavior.
- Create 30 to 100 test cases. Include real user inputs and edge cases.
- Design the prompt structure. Separate task, rules, context, examples, and output format.
- Run baseline evals. Measure the first version before editing heavily.
- Change one thing at a time. Avoid rewriting the whole prompt unless the design is wrong.
- Compare versions. Check pass rate, regressions, latency, and cost.
- Review traces. Inspect failures before deciding what to change.
- Ship behind a controlled release. Start with staging, internal users, or a small production percentage.
- Monitor production behavior. Add new failures back into the eval dataset.
Common mistakes to avoid
Optimizing for one-off examples
If a prompt gets better on a demo but worse on your eval set, it got worse. Use representative test cases before trusting any change.
Hiding business logic in prompts
Keep rules in code, config, policy systems, or structured variables when possible. Let the model use clear state instead of parsing long policy paragraphs.
Skipping evals
You cannot reliably improve what you do not measure. Even a small eval set is better than manual spot checks.
Overloading a single prompt
If one prompt classifies intent, retrieves context, chooses tools, writes an answer, validates policy, and formats JSON, split the workflow.
Ignoring model limits
Long prompts, large context, and complex tool instructions can reduce reliability. Test with the actual model, latency budget, and production data shape.
Final checklist
- Does the prompt have a clear task?
- Are rules separated from context and examples?
- Is the output format machine-checkable?
- Are business rules represented in structured state where possible?
- Does the workflow use multiple steps when the task is complex?
- Do you have evals for normal cases and edge cases?
- Can you compare prompt versions?
- Can you trace failures in production?
- Can you roll back a bad prompt change quickly?
Good prompt engineering is disciplined iteration. You define the behavior, build a test set, version the prompt, inspect traces, and improve the full system around the model.
PromptLayer helps AI teams manage prompt versions, run evals, trace LLM requests, and debug prompt chains in production. If you are building LLM apps and want a cleaner workflow for prompt engineering, create a PromptLayer account.