Back

How to Write a Prompt Definition for LLM Apps

May 29, 2026
How to Write a Prompt Definition for LLM Apps

What a prompt definition is

A prompt definition is the full specification your application uses to call an LLM for a specific task. It includes the instruction text, runtime inputs, context sources, output format, examples, model settings, business rules, tests, and version metadata.

In production LLM apps, the prompt string is only one part of the system. A useful prompt definition tells another engineer exactly how the LLM call should behave, what data it depends on, how to test it, and what can safely change.

If you only define the instruction text, you leave too much behavior hidden in application code, retrieval logic, product assumptions, and undocumented release decisions. That makes the prompt harder to debug, evaluate, and maintain.

If you need a shorter baseline definition first, PromptLayer also has a glossary entry for a prompt. This article focuses on the production-ready definition you need when shipping LLM features.

Start with the feature contract

Before writing the instruction, define the feature contract. This keeps the prompt tied to product behavior instead of a vague model interaction.

  • Feature name: A stable name such as support_ticket_triage_v3.
  • User-facing purpose: What the feature does for the user.
  • LLM task: The exact job the model performs, such as classification, extraction, rewrite, routing, ranking, or planning.
  • Caller: The service, workflow, agent, or background job that invokes the prompt.
  • Success criteria: What a correct response must satisfy.
  • Failure behavior: What the app should do when confidence is low, context is missing, or the model returns invalid output.

Example:

Feature: Support ticket triage
LLM task: Classify an inbound support ticket by urgency, product area, and required team.
Caller: api/tickets/triage.ts
Success: Returns valid JSON with urgency, category, routing_team, and rationale.
Failure: If the ticket lacks enough information, set urgency to "unknown" and routing_team to "manual_review".

This step prevents a common mistake: starting with “You are a helpful assistant” and then adding rules as bugs appear. Define the job first. Then write the prompt around that job.

Separate instructions, inputs, context, and rules

A prompt definition should make each layer explicit. Mixing everything into one freeform prompt creates brittle behavior and makes changes risky.

1. Instruction text

The instruction text tells the model what to do. Keep it task-specific and direct.

You classify support tickets for a B2B developer tools company.
Read the ticket and return a structured triage decision.
Use only the provided ticket text, customer metadata, and product taxonomy.
Do not invent missing customer details.

Avoid placing hidden product policy inside scattered prompt sentences. If a business rule affects routing, billing, safety, compliance, or customer experience, define it as a rule with an owner.

2. Runtime inputs

List every variable passed into the prompt. Include type, source, required status, and an example value.

Input Type Required Source Example
ticket_text string yes Support platform webhook "Our API keys stopped working after rotation."
customer_plan enum yes Billing service "enterprise"
product_taxonomy array yes Internal config ["auth", "billing", "observability"]

This helps you catch mismatches between prompt assumptions and application data. For example, if the prompt says “use the customer tier” but the caller passes plan_name only for paid accounts, your evals should cover missing values.

3. Context sources

Context is any task-specific information injected at runtime. It may come from retrieval, user state, product docs, database records, prior messages, tool results, or another model call.

Define context sources separately from the instruction text:

  • Source name: For example, docs_search_results.
  • Retrieval method: Keyword search, vector search, SQL query, API call, cache lookup, or static config.
  • Freshness requirement: For example, “updated within 24 hours” or “read at request time.”
  • Maximum size: Token budget, result count, or character limit.
  • Trust level: Internal source, user-provided source, generated source, or external source.
  • Conflict handling: What to do when two sources disagree.

This is especially important when you use retrieval or runtime enrichment. PromptLayer’s glossary entry on prompt augmentation covers this pattern in more detail.

4. Business rules

Business rules should be named, testable, and owned. Do not bury them in a paragraph that only the model sees.

Example:

Rule: Enterprise escalation
Owner: Support Ops
Definition: If customer_plan is "enterprise" and urgency is "high", set routing_team to "enterprise_support".
Test cases: triage_014, triage_022, triage_031
Last reviewed: 2026-01-15

When rules live outside the prompt text, your team can review them like application logic. You can also add eval cases for each rule instead of hoping the model follows a long instruction block.

Specify the output schema

Production prompts should define the response shape. A clear schema reduces parsing errors, makes evals easier, and keeps downstream code stable.

For a classification task, use a small JSON object with constrained values:

{
  "urgency": "low | medium | high | unknown",
  "category": "auth | billing | observability | integrations | unknown",
  "routing_team": "support | enterprise_support | engineering | manual_review",
  "confidence": 0.0,
  "rationale": "Short explanation based only on provided inputs."
}

Then add response rules:

  • Return valid JSON only.
  • Do not include Markdown.
  • Use unknown when the ticket lacks enough information.
  • Keep rationale under 240 characters.
  • Set confidence between 0 and 1.

If you use structured outputs or tool calling, document that too. Include the schema version, parser expectations, and what happens when validation fails.

Add examples that cover normal and edge cases

Examples give the model concrete behavior to copy. They also help reviewers understand the intended output. Include a small set of examples in the prompt definition, then keep a larger set in your eval dataset.

Use examples that cover:

  • A common successful case.
  • A missing-context case.
  • A case where two categories look plausible.
  • A business-rule case.
  • A case that should route to manual review.

Example:

Input:
ticket_text: "We rotated our API keys and now all requests return 401."
customer_plan: "enterprise"

Expected output:
{
  "urgency": "high",
  "category": "auth",
  "routing_team": "enterprise_support",
  "confidence": 0.86,
  "rationale": "Enterprise customer reports production authentication failures after key rotation."
}

Do not rely on one golden example. A triage prompt with 3 examples may look fine during a demo and fail on real tickets that mention billing, API limits, migrations, and vague error messages in the same request.

Document model settings and execution behavior

The same prompt can behave differently across models and settings. Include the execution configuration in the prompt definition.

  • Model: For example, gpt-4.1-mini, claude-3-5-sonnet, or your approved model alias.
  • Temperature: Use lower values such as 0 to 0.3 for extraction and classification.
  • Max output tokens: Set a limit that matches the schema.
  • Tool access: List tools the model can call, if any.
  • Timeouts: Define request timeout and retry behavior.
  • Fallback: Define backup model, cached response, manual queue, or safe default.

For multi-step systems, define where this prompt sits in the chain. A routing prompt, retrieval prompt, summarization prompt, and final response prompt have different failure modes. If your app uses several LLM calls, document the flow with prompt chaining instead of treating every step as an isolated string.

Write the complete prompt definition

A practical prompt definition can fit into a structured document. Your format can be YAML, JSON, a database record, or a prompt management tool. The key is that engineers can review, test, version, and run it consistently.

id: support_ticket_triage
version: 3.2.0
owner: support-platform
status: production

task:
  type: classification
  purpose: Route inbound support tickets to the correct team.

model:
  provider: openai
  model: gpt-4.1-mini
  temperature: 0.1
  max_output_tokens: 300

inputs:
  ticket_text:
    type: string
    required: true
    source: support_webhook
  customer_plan:
    type: enum
    required: true
    values: [free, pro, enterprise]
    source: billing_service
  product_taxonomy:
    type: array
    required: true
    source: internal_config

context:
  product_taxonomy:
    freshness: deployed_config
    max_items: 20
    trust_level: internal
  customer_metadata:
    freshness: request_time
    trust_level: internal

business_rules:
  - id: enterprise_escalation
    owner: support_ops
    rule: If customer_plan is enterprise and urgency is high, route to enterprise_support.

instruction: |
  You classify support tickets for a B2B developer tools company.
  Use only the provided ticket text, customer metadata, and product taxonomy.
  Return valid JSON that matches the schema.
  Do not invent missing details.

output_schema:
  urgency: low | medium | high | unknown
  category: auth | billing | observability | integrations | unknown
  routing_team: support | enterprise_support | engineering | manual_review
  confidence: number between 0 and 1
  rationale: string under 240 characters

validation:
  on_invalid_json: retry_once
  on_schema_failure: manual_review
  on_low_confidence: manual_review

evals:
  dataset: support_ticket_triage_eval_set
  minimum_accuracy: 0.92
  required_cases:
    - enterprise_escalation
    - missing_context
    - ambiguous_category

release:
  change_requires_eval_pass: true
  change_requires_approval: true

This definition gives you something closer to application code than a chat message. It is easier to diff, review, test, and roll back.

Version prompts like production artifacts

Changing a production prompt without versioning is a common source of regressions. A small wording change can affect classification thresholds, schema compliance, refusal behavior, tone, or tool use.

Use semantic or date-based versions. Track at least:

  • Prompt definition version.
  • Instruction text diff.
  • Model and setting changes.
  • Context source changes.
  • Schema changes.
  • Eval results before release.
  • Owner and approver.
  • Release time and rollback target.

For example, changing temperature from 0.1 to 0.7 should count as a versioned change for a classification prompt. So should adding a new category, changing retrieval ranking, or editing a business rule.

A prompt management workflow helps keep these changes visible to the team instead of spread across code comments, notebooks, and dashboard edits.

Connect the definition to evals

A prompt definition is incomplete without tests. You need evals that measure whether the prompt still does its job after edits, model upgrades, retrieval changes, or schema changes.

Start with 30 to 100 labeled examples for a narrow task. For higher-risk workflows, use more. Include real production-like cases, not only clean examples written by the team.

For each eval case, store:

  • Input values.
  • Injected context.
  • Expected output or grading rubric.
  • Business rule coverage.
  • Known edge case label.
  • Prompt version tested.
  • Model version tested.

Use deterministic checks where possible. For structured outputs, validate JSON, required fields, enum values, and schema compliance. For classification, compare expected labels. For generated text, use rubric grading plus spot checks for high-impact cases.

Do not wait until the prompt is “final” to add evals. Add tests when you first define the prompt. Then every prompt change becomes easier to judge.

Trace prompt behavior in production

Even strong evals miss some real-world cases. Your prompt definition should include what you log and inspect in production.

  • Prompt version.
  • Model name and settings.
  • Resolved input variables.
  • Context source IDs and retrieval metadata.
  • Output and validation result.
  • Latency, token usage, and cost.
  • Fallback or retry events.
  • User feedback or downstream correction.

This matters when a user reports a bad answer. You need to know whether the failure came from the instruction, missing context, stale retrieval results, schema drift, model behavior, or a downstream parser.

Handle chains and agents with stricter boundaries

For agents and workflows, define each LLM call separately. Do not use one giant prompt definition for planning, retrieval, tool selection, user response, and post-processing.

A useful pattern is:

  1. Planner prompt: Decide the next step or tool.
  2. Retriever prompt or query builder: Convert user intent into search queries or filters.
  3. Tool result summarizer: Compress tool output into task-specific context.
  4. Final response prompt: Produce the user-facing response.
  5. Verifier prompt: Check the output against policy, schema, or task criteria.

Each prompt needs its own inputs, context rules, schema, evals, and version history. A failure in a chained workflow is much easier to debug when every step has a clear contract.

If your team is working on compiler-like orchestration for LLM workflows, the LLM compiler concept is useful for thinking about how prompts, tools, intermediate representations, and execution plans fit together.

Common mistakes to avoid

Defining only the instruction text

The model does not run on instruction text alone. It runs with inputs, context, model settings, schemas, tools, and application logic. Define all of them.

Omitting context sources

If the prompt uses retrieved docs, customer records, conversation history, or tool output, document where that context comes from and how it is selected. Otherwise, you cannot tell whether a bad answer came from the model or from bad context.

Mixing business rules into ad hoc prompts

Business rules should be reviewable and testable. If a support escalation rule changes, you should know which prompts and eval cases are affected.

Skipping examples

Examples reduce ambiguity. They also help new engineers understand the expected behavior without reading production logs.

Failing to specify output schema

Freeform output increases parser failures and downstream branching. If your app expects structured data, define the schema and validate it.

Changing production prompts without versioning or tests

Prompt edits are code changes when they affect production behavior. Version them, run evals, and keep a rollback path.

A simple checklist

Before shipping a prompt definition, confirm that you have:

  • A clear feature contract.
  • Named runtime inputs with types and sources.
  • Documented context sources and freshness rules.
  • Separate business rules with owners.
  • Task-specific instruction text.
  • A strict output schema.
  • Examples for common and edge cases.
  • Model settings and fallback behavior.
  • Eval cases tied to the prompt version.
  • Production tracing for inputs, context, output, cost, and errors.
  • A release process with approval and rollback.

Final take

A good prompt definition gives your team a stable contract for an LLM call. It makes the prompt easier to test, review, debug, and improve. The instruction text still matters, but production reliability usually depends on everything around it: context, schemas, evals, versions, and traces.

If your team treats prompts as production artifacts, you can change them with more confidence and fewer surprises.


PromptLayer helps teams manage prompt definitions, versions, evals, datasets, and traces in one place. If you are building or shipping LLM-powered applications, create an account at https://dashboard.promptlayer.com/create-account.

The first platform built for prompt engineering