How to Debug LLM Tool Calls
How to debug LLM tool calls
Tool calls make LLM apps useful, but they also add new failure modes. A model can choose the wrong tool, pass malformed arguments, ignore a tool result, retry too many times, or complete the task with stale context. When this happens in production, reading the final answer is rarely enough. You need to inspect the full path: prompt, tool schema, model response, tool execution, returned data, and final model output.
A good debugging process turns tool call failures into observable, repeatable cases. Instead of asking “why did the model do that?” in the abstract, you should be able to answer specific questions: What tools were available? What arguments did the model generate? Did validation pass? What did the API return? Did the model use the result correctly?
Start by capturing the full tool call trace
You cannot debug what you did not record. For every LLM request that can call tools, log the complete execution path.
- User input: The raw message or task that started the run.
- System and developer prompts: The instructions that shaped tool selection and argument generation.
- Available tools: Tool names, descriptions, JSON schemas, required fields, enums, and defaults.
- Model output: The selected tool, arguments, reasoning metadata if available, and any assistant message.
- Validation result: Whether the generated arguments matched the schema.
- Tool execution: Request payload, response body, status code, latency, timeout, and exception details.
- Follow-up model call: The tool result passed back to the model and the final answer.
This is where LLM observability matters. A standard application log may show that an API call failed, but it often misses the model decision that caused the bad call. For tool-using agents, you need both application telemetry and model-level traces.
Classify the failure before changing prompts
Many teams react to tool call bugs by adding another sentence to the system prompt. That can help, but only after you know what failed. Most tool call issues fall into a few categories.
1. The model chose the wrong tool
The model selected a tool that does not match the user’s intent. For example, a support agent calls refund_order when the user only asked for the refund policy.
Common causes include vague tool names, overlapping tool descriptions, missing negative instructions, or too many tools exposed at once.
Fixes to try:
- Rename tools with action-specific names, such as
get_refund_policyandissue_refund. - Add clear “use this when” and “do not use this when” language to each tool description.
- Split high-risk tools from read-only tools.
- Limit the tool set based on the user’s current workflow or permissions.
2. The model generated malformed arguments
The model picked the right tool but sent invalid arguments. For example, it passes "tomorrow morning" into a field that expects an ISO timestamp, or it omits a required customer_id.
Fixes to try:
- Use strict JSON schema validation before executing the tool.
- Add examples for tricky fields, such as dates, currency, IDs, and enum values.
- Prefer enums over free text when the valid choices are known.
- Return validation errors to the model in a compact, structured format.
A useful validation error might look like this:
{
"error": "INVALID_ARGUMENTS",
"field": "delivery_date",
"expected": "ISO 8601 date, for example 2026-05-28",
"received": "tomorrow morning"
}3. The tool failed at runtime
The arguments were valid, but the downstream system failed. The API may return a 500, hit a rate limit, time out, or reject the request because the user lacks permission.
Fixes to try:
- Separate model errors from infrastructure errors in your logs.
- Record status codes, timeout duration, retry count, and response snippets.
- Return safe error summaries to the model instead of raw stack traces.
- Use idempotency keys for write actions, such as payments, refunds, and account updates.
4. The model ignored the tool result
The tool returned the right data, but the model answered with a different value. For example, the inventory API returns “out of stock,” but the assistant tells the user the item is available.
Fixes to try:
- Place tool results in a clearly labeled message, such as
TOOL_RESULT. - Instruct the model to treat tool results as the source of truth for specific fields.
- Ask the model to cite the exact field it used for critical answers.
- Run assertions on the final answer when the expected value is known.
5. The agent gets stuck in a retry loop
Some agents call the same tool repeatedly with the same bad arguments. This usually means the model is not receiving a useful error, the retry policy is too loose, or the agent has no clear stop condition.
Fixes to try:
- Set a maximum number of tool calls per run, such as 5 for simple workflows or 15 for research agents.
- Stop retries when the same tool and arguments fail twice.
- Tell the model when it should ask the user for missing information.
- Summarize previous failed attempts in the next model call.
Debug with a minimal reproduction
Once you find a bad trace, reduce it to the smallest case that still fails. Keep the same model, tool schema, relevant prompt sections, and user input. Remove unrelated conversation history, extra tools, and application state.
A minimal reproduction helps you test changes quickly. For example, if a booking assistant sends an invalid airport code, create a test case with only the flight search tool and the exact user request that caused the bug. Then change one variable at a time: tool description, schema, examples, or prompt instructions.
A good reproduction should include:
- The original user input.
- The exact tool definitions available to the model.
- The model name and temperature.
- The generated tool call.
- The expected tool call.
- The reason the original output failed.
Make tool schemas hard to misuse
Tool schemas are part of your prompt surface. If they are vague, the model will guess. If they are precise, the model has a better chance of producing valid calls.
Use these schema practices:
- Use specific names: Prefer
search_customer_ordersoversearch. - Write short descriptions: A 2-sentence description usually works better than a long paragraph.
- Mark required fields carefully: Do not require fields the model cannot know.
- Use enums: Replace open text fields with fixed values when possible.
- Add field descriptions: Explain expected formats, units, and constraints.
- Separate read and write tools: Keep
get_invoiceseparate fromsend_invoice.
Here is a cleaner schema pattern for a support tool:
{
"name": "get_order_status",
"description": "Use this to check the current shipping or delivery status for an existing order. Do not use it to create, cancel, or refund an order.",
"parameters": {
"type": "object",
"properties": {
"order_id": {
"type": "string",
"description": "The order ID shown to the customer, for example ORD-12345."
}
},
"required": ["order_id"]
}
}Use evals for repeatable debugging
After you fix a tool call bug, turn it into a regression test. Otherwise, the same issue can return when you change the system prompt, add a tool, switch models, or adjust retrieval.
For tool calls, LLM evaluation should check more than the final answer. You can evaluate the intermediate steps too.
- Tool selection accuracy: Did the model choose the expected tool?
- Argument validity: Did the arguments pass schema validation?
- Argument correctness: Did the arguments match the user’s request?
- Execution success: Did the tool return a valid result?
- Result usage: Did the final answer reflect the tool output?
- Safety checks: Did the model avoid restricted tools or unauthorized actions?
For example, if a user asks, “What is the status of order ORD-12345?”, your test can assert that the model calls get_order_status with {"order_id":"ORD-12345"}. You do not need to wait until the final answer to catch the error.
Consider LLM-as-a-judge for complex cases
Some tool call behavior is easy to check with code. JSON validity, required fields, exact IDs, and status codes should use deterministic checks.
Other cases need judgment. For example, did the model ask a reasonable clarification question instead of guessing? Did it choose the safest read-only tool before a write action? Did it use the retrieved policy correctly?
In those cases, LLM-as-a-judge can help grade traces. Use it carefully. Give the judge a rubric, include the tool definitions and tool results, and ask for a structured score. A simple 1 to 5 score with a short reason is often enough for triage.
Track the metrics that point to real failures
Tool call debugging gets easier when you track operational metrics over time. Start with a small set that maps to user impact.
- Tool call success rate: Percentage of tool calls that execute without validation or runtime errors.
- Invalid argument rate: Percentage of calls rejected by schema validation.
- Wrong tool rate: Percentage of reviewed traces where the selected tool was incorrect.
- Retry rate: Average number of tool retries per run.
- Loop termination rate: Percentage of runs stopped by max-call limits.
- Tool latency: p50, p95, and p99 latency by tool.
- Final answer mismatch rate: Percentage of answers that conflict with tool results.
For a customer support agent, a healthy baseline might be a tool execution success rate above 98%, invalid argument rate below 1%, and p95 tool latency under 2 seconds. Your numbers will vary by product, but the trend matters. A sudden jump in invalid arguments after a prompt change is a strong signal.
Add guardrails around high-risk tools
Some tools should never run based only on a loose model decision. Refunds, payments, account deletion, permission changes, and outbound emails need tighter controls.
Practical guardrails include:
- Permission checks: Verify that the user can perform the action before execution.
- Confirmation steps: Ask the user to confirm before irreversible actions.
- Dry runs: Let the model preview what would happen before committing the action.
- Amount limits: Require extra review for refunds above a threshold, such as $100.
- Idempotency: Prevent duplicate writes during retries.
- Policy checks: Block actions that violate business rules, even if the model requests them.
Guardrails should live in application code, not only in the prompt. A system prompt can tell the model not to issue refunds without confirmation, but your backend should still reject an unconfirmed refund request.
Handle prompt chaining and multi-step tool flows
Tool bugs often appear in multi-step flows. An agent may retrieve a document, summarize it, call an API, then write a final response. A small mistake in step 1 can corrupt step 4.
When debugging chained calls, inspect each boundary:
- What context entered this step?
- What decision did the model make?
- What tool was called?
- What data came back?
- What state was saved for the next step?
If your app uses planning, routing, or compiled prompt flows, an LLM compiler can make these paths easier to reason about by turning complex prompt logic into structured execution. The main debugging principle stays the same: capture every step and test each transition.
A practical debugging checklist
Use this checklist when a tool call fails in production:
- Find the full trace for the run.
- Confirm which tools were available to the model.
- Check whether the chosen tool matched the user’s intent.
- Validate the generated arguments against the schema.
- Compare generated arguments with the expected arguments.
- Inspect the tool response, status code, latency, and errors.
- Check whether retries changed anything or repeated the same failure.
- Verify that the final answer used the tool result correctly.
- Create a minimal reproduction.
- Add the case to your eval suite before shipping the fix.
Common fixes that work in production
Most production fixes fall into a few practical buckets:
- Reduce tool ambiguity: Rename tools, improve descriptions, and remove overlapping capabilities.
- Strengthen validation: Reject bad arguments before execution and return clear errors.
- Constrain retries: Add max-call limits, duplicate-call detection, and stop conditions.
- Improve context: Pass only the relevant user state and tool results into each model call.
- Add eval coverage: Test tool choice, arguments, execution, and final answer alignment.
- Protect risky actions: Enforce permissions, confirmations, and idempotency in code.
The best debugging workflow is systematic. Capture traces, classify the failure, reproduce it, apply a targeted fix, and add a regression test. This keeps your tool-calling system stable as your prompts, tools, models, and product flows change.
PromptLayer helps teams trace LLM runs, debug tool calls, manage prompt versions, and build evals for production AI systems. If you are shipping agents or tool-using LLM apps, create a PromptLayer account to start tracking and improving your prompts today.