How to Write a Prompt Definition for LLM Apps
What a prompt definition is
A prompt definition is the full specification your application uses to call an LLM for a specific task. It includes the instruction text, runtime inputs, context sources, output format, examples, model settings, business rules, tests, and version metadata.
In production LLM apps, the prompt string is only one part of the system. A useful prompt definition tells another engineer exactly how the LLM call should behave, what data it depends on, how to test it, and what can safely change.
If you only define the instruction text, you leave too much behavior hidden in application code, retrieval logic, product assumptions, and undocumented release decisions. That makes the prompt harder to debug, evaluate, and maintain.
If you need a shorter baseline definition first, PromptLayer also has a glossary entry for a prompt. This article focuses on the production-ready definition you need when shipping LLM features.
Start with the feature contract
Before writing the instruction, define the feature contract. This keeps the prompt tied to product behavior instead of a vague model interaction.
- Feature name: A stable name such as
support_ticket_triage_v3. - User-facing purpose: What the feature does for the user.
- LLM task: The exact job the model performs, such as classification, extraction, rewrite, routing, ranking, or planning.
- Caller: The service, workflow, agent, or background job that invokes the prompt.
- Success criteria: What a correct response must satisfy.
- Failure behavior: What the app should do when confidence is low, context is missing, or the model returns invalid output.
Example:
Feature: Support ticket triage
LLM task: Classify an inbound support ticket by urgency, product area, and required team.
Caller: api/tickets/triage.ts
Success: Returns valid JSON with urgency, category, routing_team, and rationale.
Failure: If the ticket lacks enough information, set urgency to "unknown" and routing_team to "manual_review".This step prevents a common mistake: starting with “You are a helpful assistant” and then adding rules as bugs appear. Define the job first. Then write the prompt around that job.
Separate instructions, inputs, context, and rules
A prompt definition should make each layer explicit. Mixing everything into one freeform prompt creates brittle behavior and makes changes risky.
1. Instruction text
The instruction text tells the model what to do. Keep it task-specific and direct.
You classify support tickets for a B2B developer tools company.
Read the ticket and return a structured triage decision.
Use only the provided ticket text, customer metadata, and product taxonomy.
Do not invent missing customer details.Avoid placing hidden product policy inside scattered prompt sentences. If a business rule affects routing, billing, safety, compliance, or customer experience, define it as a rule with an owner.
2. Runtime inputs
List every variable passed into the prompt. Include type, source, required status, and an example value.
| Input | Type | Required | Source | Example |
|---|---|---|---|---|
ticket_text |
string | yes | Support platform webhook | "Our API keys stopped working after rotation." |
customer_plan |
enum | yes | Billing service | "enterprise" |
product_taxonomy |
array | yes | Internal config | ["auth", "billing", "observability"] |
This helps you catch mismatches between prompt assumptions and application data. For example, if the prompt says “use the customer tier” but the caller passes plan_name only for paid accounts, your evals should cover missing values.
3. Context sources
Context is any task-specific information injected at runtime. It may come from retrieval, user state, product docs, database records, prior messages, tool results, or another model call.
Define context sources separately from the instruction text:
- Source name: For example,
docs_search_results. - Retrieval method: Keyword search, vector search, SQL query, API call, cache lookup, or static config.
- Freshness requirement: For example, “updated within 24 hours” or “read at request time.”
- Maximum size: Token budget, result count, or character limit.
- Trust level: Internal source, user-provided source, generated source, or external source.
- Conflict handling: What to do when two sources disagree.
This is especially important when you use retrieval or runtime enrichment. PromptLayer’s glossary entry on prompt augmentation covers this pattern in more detail.
4. Business rules
Business rules should be named, testable, and owned. Do not bury them in a paragraph that only the model sees.
Example:
Rule: Enterprise escalation
Owner: Support Ops
Definition: If customer_plan is "enterprise" and urgency is "high", set routing_team to "enterprise_support".
Test cases: triage_014, triage_022, triage_031
Last reviewed: 2026-01-15When rules live outside the prompt text, your team can review them like application logic. You can also add eval cases for each rule instead of hoping the model follows a long instruction block.
Specify the output schema
Production prompts should define the response shape. A clear schema reduces parsing errors, makes evals easier, and keeps downstream code stable.
For a classification task, use a small JSON object with constrained values:
{
"urgency": "low | medium | high | unknown",
"category": "auth | billing | observability | integrations | unknown",
"routing_team": "support | enterprise_support | engineering | manual_review",
"confidence": 0.0,
"rationale": "Short explanation based only on provided inputs."
}Then add response rules:
- Return valid JSON only.
- Do not include Markdown.
- Use
unknownwhen the ticket lacks enough information. - Keep
rationaleunder 240 characters. - Set
confidencebetween0and1.
If you use structured outputs or tool calling, document that too. Include the schema version, parser expectations, and what happens when validation fails.
Add examples that cover normal and edge cases
Examples give the model concrete behavior to copy. They also help reviewers understand the intended output. Include a small set of examples in the prompt definition, then keep a larger set in your eval dataset.
Use examples that cover:
- A common successful case.
- A missing-context case.
- A case where two categories look plausible.
- A business-rule case.
- A case that should route to manual review.
Example:
Input:
ticket_text: "We rotated our API keys and now all requests return 401."
customer_plan: "enterprise"
Expected output:
{
"urgency": "high",
"category": "auth",
"routing_team": "enterprise_support",
"confidence": 0.86,
"rationale": "Enterprise customer reports production authentication failures after key rotation."
}Do not rely on one golden example. A triage prompt with 3 examples may look fine during a demo and fail on real tickets that mention billing, API limits, migrations, and vague error messages in the same request.
Document model settings and execution behavior
The same prompt can behave differently across models and settings. Include the execution configuration in the prompt definition.
- Model: For example,
gpt-4.1-mini,claude-3-5-sonnet, or your approved model alias. - Temperature: Use lower values such as
0to0.3for extraction and classification. - Max output tokens: Set a limit that matches the schema.
- Tool access: List tools the model can call, if any.
- Timeouts: Define request timeout and retry behavior.
- Fallback: Define backup model, cached response, manual queue, or safe default.
For multi-step systems, define where this prompt sits in the chain. A routing prompt, retrieval prompt, summarization prompt, and final response prompt have different failure modes. If your app uses several LLM calls, document the flow with prompt chaining instead of treating every step as an isolated string.
Write the complete prompt definition
A practical prompt definition can fit into a structured document. Your format can be YAML, JSON, a database record, or a prompt management tool. The key is that engineers can review, test, version, and run it consistently.
id: support_ticket_triage
version: 3.2.0
owner: support-platform
status: production
task:
type: classification
purpose: Route inbound support tickets to the correct team.
model:
provider: openai
model: gpt-4.1-mini
temperature: 0.1
max_output_tokens: 300
inputs:
ticket_text:
type: string
required: true
source: support_webhook
customer_plan:
type: enum
required: true
values: [free, pro, enterprise]
source: billing_service
product_taxonomy:
type: array
required: true
source: internal_config
context:
product_taxonomy:
freshness: deployed_config
max_items: 20
trust_level: internal
customer_metadata:
freshness: request_time
trust_level: internal
business_rules:
- id: enterprise_escalation
owner: support_ops
rule: If customer_plan is enterprise and urgency is high, route to enterprise_support.
instruction: |
You classify support tickets for a B2B developer tools company.
Use only the provided ticket text, customer metadata, and product taxonomy.
Return valid JSON that matches the schema.
Do not invent missing details.
output_schema:
urgency: low | medium | high | unknown
category: auth | billing | observability | integrations | unknown
routing_team: support | enterprise_support | engineering | manual_review
confidence: number between 0 and 1
rationale: string under 240 characters
validation:
on_invalid_json: retry_once
on_schema_failure: manual_review
on_low_confidence: manual_review
evals:
dataset: support_ticket_triage_eval_set
minimum_accuracy: 0.92
required_cases:
- enterprise_escalation
- missing_context
- ambiguous_category
release:
change_requires_eval_pass: true
change_requires_approval: trueThis definition gives you something closer to application code than a chat message. It is easier to diff, review, test, and roll back.
Version prompts like production artifacts
Changing a production prompt without versioning is a common source of regressions. A small wording change can affect classification thresholds, schema compliance, refusal behavior, tone, or tool use.
Use semantic or date-based versions. Track at least:
- Prompt definition version.
- Instruction text diff.
- Model and setting changes.
- Context source changes.
- Schema changes.
- Eval results before release.
- Owner and approver.
- Release time and rollback target.
For example, changing temperature from 0.1 to 0.7 should count as a versioned change for a classification prompt. So should adding a new category, changing retrieval ranking, or editing a business rule.
A prompt management workflow helps keep these changes visible to the team instead of spread across code comments, notebooks, and dashboard edits.
Connect the definition to evals
A prompt definition is incomplete without tests. You need evals that measure whether the prompt still does its job after edits, model upgrades, retrieval changes, or schema changes.
Start with 30 to 100 labeled examples for a narrow task. For higher-risk workflows, use more. Include real production-like cases, not only clean examples written by the team.
For each eval case, store:
- Input values.
- Injected context.
- Expected output or grading rubric.
- Business rule coverage.
- Known edge case label.
- Prompt version tested.
- Model version tested.
Use deterministic checks where possible. For structured outputs, validate JSON, required fields, enum values, and schema compliance. For classification, compare expected labels. For generated text, use rubric grading plus spot checks for high-impact cases.
Do not wait until the prompt is “final” to add evals. Add tests when you first define the prompt. Then every prompt change becomes easier to judge.
Trace prompt behavior in production
Even strong evals miss some real-world cases. Your prompt definition should include what you log and inspect in production.
- Prompt version.
- Model name and settings.
- Resolved input variables.
- Context source IDs and retrieval metadata.
- Output and validation result.
- Latency, token usage, and cost.
- Fallback or retry events.
- User feedback or downstream correction.
This matters when a user reports a bad answer. You need to know whether the failure came from the instruction, missing context, stale retrieval results, schema drift, model behavior, or a downstream parser.
Handle chains and agents with stricter boundaries
For agents and workflows, define each LLM call separately. Do not use one giant prompt definition for planning, retrieval, tool selection, user response, and post-processing.
A useful pattern is:
- Planner prompt: Decide the next step or tool.
- Retriever prompt or query builder: Convert user intent into search queries or filters.
- Tool result summarizer: Compress tool output into task-specific context.
- Final response prompt: Produce the user-facing response.
- Verifier prompt: Check the output against policy, schema, or task criteria.
Each prompt needs its own inputs, context rules, schema, evals, and version history. A failure in a chained workflow is much easier to debug when every step has a clear contract.
If your team is working on compiler-like orchestration for LLM workflows, the LLM compiler concept is useful for thinking about how prompts, tools, intermediate representations, and execution plans fit together.
Common mistakes to avoid
Defining only the instruction text
The model does not run on instruction text alone. It runs with inputs, context, model settings, schemas, tools, and application logic. Define all of them.
Omitting context sources
If the prompt uses retrieved docs, customer records, conversation history, or tool output, document where that context comes from and how it is selected. Otherwise, you cannot tell whether a bad answer came from the model or from bad context.
Mixing business rules into ad hoc prompts
Business rules should be reviewable and testable. If a support escalation rule changes, you should know which prompts and eval cases are affected.
Skipping examples
Examples reduce ambiguity. They also help new engineers understand the expected behavior without reading production logs.
Failing to specify output schema
Freeform output increases parser failures and downstream branching. If your app expects structured data, define the schema and validate it.
Changing production prompts without versioning or tests
Prompt edits are code changes when they affect production behavior. Version them, run evals, and keep a rollback path.
A simple checklist
Before shipping a prompt definition, confirm that you have:
- A clear feature contract.
- Named runtime inputs with types and sources.
- Documented context sources and freshness rules.
- Separate business rules with owners.
- Task-specific instruction text.
- A strict output schema.
- Examples for common and edge cases.
- Model settings and fallback behavior.
- Eval cases tied to the prompt version.
- Production tracing for inputs, context, output, cost, and errors.
- A release process with approval and rollback.
Final take
A good prompt definition gives your team a stable contract for an LLM call. It makes the prompt easier to test, review, debug, and improve. The instruction text still matters, but production reliability usually depends on everything around it: context, schemas, evals, versions, and traces.
If your team treats prompts as production artifacts, you can change them with more confidence and fewer surprises.
PromptLayer helps teams manage prompt definitions, versions, evals, datasets, and traces in one place. If you are building or shipping LLM-powered applications, create an account at https://dashboard.promptlayer.com/create-account.