How to Define Few-Shot Context
How to Define Few-Shot Context
Few-shot context is the set of examples you give an LLM so it can infer the pattern you want: the task, the input shape, the reasoning style, the output format, and the edge cases that matter. For teams shipping LLM-powered products, defining this context well can make the difference between a prompt that works in a demo and one that stays reliable in production.
Few-shot prompting works because models can adapt to patterns inside the prompt without retraining. This is a form of in-context learning. The examples act as temporary training data for that request.
The hard part is choosing the right examples. More examples do not always mean better performance. Poor examples can teach the model the wrong schema, hide failure cases, increase latency, and consume budget inside the context window.
Start with the behavior you need
Before writing examples, define the target behavior in concrete terms. A useful few-shot prompt starts with a clear contract:
- Task: What should the model do?
- Inputs: What variables will change at runtime?
- Output: What exact schema should the model return?
- Decision rules: What should the model optimize for?
- Failure behavior: What should happen when the input is incomplete, unsafe, or out of scope?
- Evaluation: How will you decide whether the output is correct?
For example, if you are building a support ticket classifier, do not start by pasting ten random tickets into a prompt. Start by defining the labels, the JSON output, and the edge cases.
Task: Classify incoming support tickets.
Allowed labels:
- billing
- bug
- account_access
- feature_request
- other
Return valid JSON only:
{
"label": "billing | bug | account_access | feature_request | other",
"confidence": 0.0,
"reason": "short explanation"
}
If the ticket contains multiple issues, choose the label that represents the user's primary request.Once you have that contract, examples can reinforce it instead of replacing it.
Use examples to teach boundaries, not obvious cases
A common mistake is choosing only easy examples. If every example is clean, short, and obvious, the model may perform well in testing and fail on real user input.
Good few-shot context should include examples that define the boundaries of the task. For a classifier, include cases where labels are easy to confuse. For an extraction task, include missing fields, noisy text, and values that should be ignored. For an agent workflow, include examples where the agent should stop, ask for clarification, or refuse to call a tool.
Example selection criteria
| Criterion | What to include | Example |
|---|---|---|
| Representative cases | Inputs that match normal production traffic | A typical billing question with one clear request |
| Boundary cases | Inputs that sit between two labels or actions | A user mentions a refund and a login issue, but mainly asks about account access |
| Negative cases | Inputs that should not trigger the desired behavior | A feature complaint that should not be classified as a bug |
| Format stress tests | Messy, long, or incomplete inputs | A ticket with copied logs, typos, and missing account details |
| High-value failures | Cases where a wrong output creates real cost | A cancellation request incorrectly routed as a feature request |
If you are documenting this for your team, include a screenshot or table like this next to the prompt. It gives reviewers a faster way to see why each example exists.
Keep labels and output schemas consistent
Inconsistent labels are one of the fastest ways to degrade few-shot performance. If one example uses "account", another uses "account_access", and the instructions say "login_issue", the model has to guess which pattern wins.
Use the same labels, field names, casing, and data types everywhere:
- Use
"account_access"every time, not"Account Access"in one example and"login"in another. - Return
confidenceas a number every time, not sometimes as"high". - Use the same JSON field order when possible.
- Keep explanations short if production expects short explanations.
This matters even more when downstream code parses the model response. If your application expects strict JSON, your few-shot examples should all return strict JSON. Do not include prose around the output unless your parser expects prose.
Separate instructions from examples
Another common mistake is mixing instructions and examples so the model cannot tell which text is the rule and which text is sample data. Use clear sections and stable delimiters.
A clean structure looks like this:
System:
You classify support tickets into one allowed label.
Instructions:
- Return valid JSON only.
- Use one label from the allowed_labels list.
- If multiple issues appear, choose the primary user request.
Allowed labels:
billing, bug, account_access, feature_request, other
Examples:
Example 1 input:
"I was charged twice for my subscription this month."
Example 1 output:
{
"label": "billing",
"confidence": 0.94,
"reason": "The user is asking about an incorrect charge."
}
Example 2 input:
"I can't reset my password because the email never arrives."
Example 2 output:
{
"label": "account_access",
"confidence": 0.91,
"reason": "The user cannot access the account due to a password reset issue."
}
User input:
{{ticket_text}}This format reduces ambiguity. The runtime variable appears once, after the examples. The examples use the same schema the application expects. The model sees the rule, the pattern, and the current input in a predictable order.
Use the minimum number of examples that changes behavior
Adding too many examples is easy. It can also make your prompt slower, more expensive, and harder to maintain. Every example competes with the user input, retrieved context, tool results, and system instructions for tokens.
Start small:
- Run a zero-shot version with clear instructions and no examples.
- Add one strong example if the model misses the format or core behavior. This is similar to one-shot prompting.
- Add two to five examples only if they cover distinct failure modes.
- Remove examples that do not improve evaluation results.
For many classification, routing, and structured extraction prompts, three to five examples are enough. For complex writing style transfer or multi-step reasoning tasks, you may need more. Measure the tradeoff instead of guessing.
Track token cost and latency
Few-shot context has a direct cost. If each example is 250 tokens and you include eight examples, you added about 2,000 input tokens to every request before the user input appears.
That cost can be worth it if accuracy improves enough. It can be wasteful if two examples produce the same result. Track these numbers during prompt review:
- Input tokens per example: Count both input and output portions.
- Total prompt tokens: Include system instructions, variables, retrieved context, and examples.
- Latency impact: Compare p50 and p95 latency with and without examples.
- Cost per successful task: A longer prompt may be cheaper if it reduces retries or manual review.
For example, a support classifier that costs $0.002 per request with zero-shot prompting may cost $0.004 with few-shot context. If the few-shot version cuts misroutes by 30%, the extra cost may be reasonable. If accuracy only improves by 1%, remove or shorten the examples.
Use variables carefully
Few-shot prompts often combine fixed examples with runtime variables. Keep those variables easy to inspect. A prompt with hidden preprocessing, dynamic example selection, and retrieved context can become hard to debug.
Use clear variable names:
{{ticket_text}}{{customer_plan}}{{allowed_labels}}{{retrieved_policy}}
When a model produces a bad output, you should be able to inspect the final rendered prompt and answer three questions:
- What examples did the model see?
- What variable values were inserted?
- Did any instruction conflict with the examples?
A PromptLayer trace is useful here because it can show the rendered prompt, variables, selected examples, model settings, response, latency, and token usage in one place. For internal documentation, include a screenshot of a trace that shows the examples and runtime variables side by side.
Test zero-shot and few-shot versions against the same dataset
Few-shot context should earn its place. Do not keep examples because they feel safer. Compare versions against the same evaluation dataset.
A simple eval comparison might look like this:
| Prompt version | Examples | Accuracy | Invalid JSON rate | Avg input tokens | p95 latency |
|---|---|---|---|---|---|
| v1 zero-shot | 0 | 84% | 6% | 420 | 1.2s |
| v2 few-shot | 3 | 91% | 1% | 1,150 | 1.8s |
| v3 few-shot | 7 | 92% | 1% | 2,230 | 2.7s |
In this example, v2 is probably the best production choice. v3 adds four examples but only improves accuracy by one point while increasing latency and token usage. The better prompt is not always the longest prompt.
Re-test after model changes
Few-shot context is model-dependent. A prompt that works well on one model can become too verbose, too weak, or misleading on another. Even a minor model version change can affect how the model follows examples, handles JSON, or balances instructions against sample outputs.
Re-run evals when you change:
- The model provider
- The model version
- Temperature or decoding settings
- System instructions
- Output schema
- Example selection logic
- Retrieved context or tool results
Keep old prompt versions and eval results. If a model upgrade causes regressions, you need to know whether the failure came from the model, the examples, the schema, or the surrounding workflow.
Common mistakes to avoid
Adding too many examples
More examples can dilute the pattern, increase cost, and push important context out of the prompt. Start with the smallest set that improves your eval metrics.
Using inconsistent labels or schemas
If your examples use different field names or label formats, the model may copy the wrong one. Treat example outputs as part of your API contract.
Choosing only easy examples
Easy examples make the prompt look better than it is. Include ambiguous and high-risk cases from production data.
Mixing rules and examples
Use section headers and delimiters. Put instructions before examples. Put the current user input after examples.
Ignoring token cost
Track cost, latency, and context usage for every prompt version. A few-shot prompt that improves accuracy but doubles latency may still be the wrong fit for a real-time path.
Failing to re-test after model changes
Do not assume examples transfer cleanly between models. Re-run the same eval set before shipping a model or prompt update.
A practical workflow for defining few-shot context
- Define the task contract: Write the task, allowed outputs, schema, and failure behavior.
- Create a zero-shot baseline: Test clear instructions without examples.
- Collect candidate examples: Pull real or realistic inputs that represent common cases, edge cases, and costly failures.
- Normalize outputs: Make every example follow the same label set and schema.
- Add examples one at a time: Measure the impact after each addition.
- Review token cost: Remove examples that do not improve results.
- Run evals: Compare zero-shot, one-shot, and few-shot versions on the same dataset.
- Trace production behavior: Inspect rendered prompts, variables, examples, and outputs when failures occur.
- Version everything: Save prompt versions, model settings, eval results, and notes about why examples changed.
What to include in your prompt documentation
Good documentation helps engineers review prompt changes without reading every token manually. For each few-shot prompt, include:
- An annotated prompt showing instructions, variables, examples, and output schema
- A table explaining why each example was selected
- A PromptLayer trace showing the final rendered prompt with variables and examples
- An eval comparison between zero-shot, one-shot, and few-shot versions
- Token usage, latency, and cost by prompt version
- Known failure cases and planned follow-up tests
This gives your team a practical review process. It also makes prompt changes easier to debug after deployment.
Final checklist
- Does every example teach a distinct behavior?
- Are the labels and output schema identical across all examples?
- Do the examples include edge cases and realistic production inputs?
- Are instructions clearly separated from examples?
- Have you measured zero-shot versus few-shot performance?
- Do you know the added token cost and latency?
- Can you inspect the final rendered prompt in a trace?
- Will you re-run evals after model or prompt changes?
Few-shot context works best when you treat examples as testable prompt assets, not filler text. Choose them with intent, keep them consistent, and measure whether they improve the behavior your application needs.
PromptLayer helps AI teams manage prompt versions, trace variables and examples, run evals, and compare prompt behavior before shipping changes. To start building and testing few-shot prompts with better visibility, create a PromptLayer account.