Fixing Bad Tool Arguments in LLM Apps: Practical Strategies for Developers

How to Fix Bad Tool Arguments

Bad tool arguments are one of the most common failure modes in LLM applications. The model picks the right tool, but sends the wrong payload. The result is a failed API call, a silent no-op, corrupted state, or a user-facing answer based on an action that never happened.

This usually does not mean the model is “bad at tools.” It usually means your tool contract is unclear, your schema is too loose, required context is hidden, or your validation path accepts broken data.

If you are shipping agents, copilots, workflow automation, or API-backed assistants, treat tool arguments as a production interface. Validate them. Version them. Test them against real cases. Trace every failure.

What bad tool arguments look like

A tool call can fail in several ways:

Missing required fields: the model omits an ID, timestamp, user scope, or enum.
Wrong data type: the model sends a string where your code expects an array, number, boolean, or object.
Invalid enum value: the model invents values like "urgent_high" when the API accepts "high".
Ambiguous identifiers: the model sends a customer name instead of a stable customer_id.
Overfilled payloads: the model includes fields your backend ignores or rejects.
Under-specified intent: the model calls a broad tool with vague arguments, then your application guesses what to do.
Invalid JSON: the model returns malformed JSON, and your app tries to repair it silently.

Here is a typical failing tool call payload:

{
  "tool": "create_ticket",
  "arguments": {
    "customer": "Acme",
    "priority": "urgent",
    "issue": "Billing problem",
    "due": "tomorrow"
  }
}

This payload looks understandable to a human, but it is weak as an API contract. Your backend may need customer_id, a known priority enum, a structured description, and an ISO 8601 due date.

Start by finding the exact failure

Do not start by changing temperature. Temperature can affect variability, but it will not fix an unclear schema, missing context, or weak validation.

First, capture the tool call and the validator error in a trace. You need to know whether the model misunderstood the task, lacked the needed data, or followed a bad contract correctly.

Example trace with validation error

{
  "trace_id": "trc_82f4",
  "model": "gpt-4.1",
  "prompt_version": "ticket-router-v12",
  "tool_call": {
    "name": "create_ticket",
    "arguments": {
      "customer": "Acme",
      "priority": "urgent",
      "issue": "Billing problem",
      "due": "tomorrow"
    }
  },
  "validation_error": {
    "customer_id": "required",
    "priority": "must be one of: low, medium, high",
    "description": "required",
    "due_date": "must be ISO 8601 date, received natural language"
  }
}

This trace tells you the problem is not random. The schema asks too little, the tool description allows vague fields, and the prompt likely hides the source of valid customer IDs.

Common mistake 1: vague tool descriptions

A tool description like this is too vague:

{
  "name": "create_ticket",
  "description": "Creates a support ticket for a customer."
}

The model does not know which fields matter, where IDs come from, what date format to use, or when it should ask a follow-up question.

Use a description that explains the tool boundary and failure behavior:

{
  "name": "create_ticket",
  "description": "Create a support ticket only when customer_id, priority, description, and due_date are known. Use customer_id from the provided customer lookup context. Do not guess IDs. If the customer_id is missing, ask a follow-up question or call lookup_customer first. due_date must be an ISO 8601 date in YYYY-MM-DD format."
}

Good tool descriptions should answer four questions:

When should the model call this tool?
When should the model avoid this tool?
Where should required values come from?
What exact format should each argument use?

Common mistake 2: overly broad tools

Broad tools produce broad arguments. A tool like update_account often causes the model to send vague actions such as "fix billing" or "change plan".

{
  "name": "update_account",
  "description": "Updates a customer account.",
  "parameters": {
    "type": "object",
    "properties": {
      "customer_id": { "type": "string" },
      "update": { "type": "string" }
    },
    "required": ["customer_id", "update"]
  }
}

This makes your backend parse intent from a free-text string. That is fragile.

Split broad tools into narrower tools when the action has different required fields or safety rules:

change_subscription_plan
update_billing_email
pause_subscription
create_refund_request

Narrow tools reduce argument ambiguity. They also make evals easier because each tool has a clearer expected payload.

Common mistake 3: hidden required context

If a tool requires customer_id, account_id, workspace_id, or region, the model needs access to that value before it calls the tool.

Do not expect the model to infer stable IDs from names unless you gave it a lookup table or a lookup tool.

Weak context:

User: Create a high priority billing ticket for Acme.

Better context:

Known customers:
- Acme Corp: customer_id=cus_1042, region=us-east-1
- Acme Supplies: customer_id=cus_8831, region=eu-west-1

If the user says "Acme" and the target customer is ambiguous, ask a follow-up question before creating a ticket.

If your app already knows the active workspace or authenticated user, pass those values as structured context. Do not bury them in a long natural language paragraph.

Common mistake 4: accepting invalid JSON silently

Silent repair creates hidden production bugs. If your parser accepts invalid JSON, coerces types, drops unknown fields, or fills defaults without logging, you lose the signal you need to fix the system.

Reject invalid payloads clearly. Return validation errors to the orchestration layer. Capture them in your trace data.

For example, this should fail validation:

{
  "customer_id": "cus_1042",
  "priority": "urgent",
  "description": "Customer says invoice total is wrong.",
  "due_date": "tomorrow"
}

Expected validation result:

{
  "valid": false,
  "errors": [
    {
      "path": "priority",
      "message": "Expected one of: low, medium, high"
    },
    {
      "path": "due_date",
      "message": "Expected YYYY-MM-DD"
    }
  ]
}

After validation fails, you have two safe options:

Ask the model to retry with the validation error and the original context.
Ask the user for missing information if the model cannot recover without guessing.

Fix the schema before you tune the prompt

A weak schema forces the model to guess. Make the schema specific before you spend time on prompt wording.

Before: loose schema

{
  "name": "create_ticket",
  "parameters": {
    "type": "object",
    "properties": {
      "customer": {
        "type": "string"
      },
      "priority": {
        "type": "string"
      },
      "issue": {
        "type": "string"
      },
      "due": {
        "type": "string"
      }
    },
    "required": ["customer", "priority", "issue"]
  }
}

This schema allows natural language dates, customer names, vague issue text, and invented priority values.

After: stricter schema

{
  "name": "create_ticket",
  "description": "Create a support ticket for a known customer. Use this only after customer_id is known. Do not guess customer_id.",
  "parameters": {
    "type": "object",
    "additionalProperties": false,
    "properties": {
      "customer_id": {
        "type": "string",
        "description": "Stable customer ID from lookup_customer or provided context. Example: cus_1042",
        "pattern": "^cus_[0-9]+$"
      },
      "priority": {
        "type": "string",
        "enum": ["low", "medium", "high"],
        "description": "Use high only for outages, billing blocks, security issues, or explicit user urgency."
      },
      "description": {
        "type": "string",
        "minLength": 20,
        "description": "Clear support ticket description with the user-reported problem and relevant details."
      },
      "due_date": {
        "type": "string",
        "format": "date",
        "description": "Date in YYYY-MM-DD format. If the user gives a relative date, resolve it using the current date from context."
      }
    },
    "required": ["customer_id", "priority", "description", "due_date"]
  }
}

The stricter schema does three useful things:

It forces stable identifiers instead of names.
It restricts priority to valid values.
It blocks extra fields that your backend does not expect.

Add a repair loop for recoverable errors

Validation should happen before the tool touches your backend. If validation fails, send a compact error back to the model and ask it to produce a corrected call.

Example retry instruction:

The previous tool call failed validation.

Errors:
- priority must be one of: low, medium, high
- due_date must use YYYY-MM-DD

Original user request:
"Create an urgent billing ticket for Acme due tomorrow."

Context:
- current_date: 2025-02-10
- Acme Corp customer_id: cus_1042

Return a corrected create_ticket tool call. Do not invent fields.

Expected corrected payload:

{
  "customer_id": "cus_1042",
  "priority": "high",
  "description": "Customer reported a billing problem and asked for urgent support.",
  "due_date": "2025-02-11"
}

Keep this retry loop bounded. One retry is often enough. Two retries may be acceptable for low-risk workflows. For financial, legal, medical, or account-changing actions, fail closed and ask the user to confirm missing or ambiguous details.

Use prompt changes only after the contract is clear

Prompt wording still matters. Once your schema is strict and your validation is visible, update the prompt to teach the model how to behave around missing or ambiguous arguments.

Add rules like these:

If a required field is missing, do not call the tool.
If an ID is required and only a name is present, call the lookup tool first.
If two records match the user request, ask a follow-up question.
Use enum values exactly as defined in the schema.
Use the provided current date to resolve relative dates.
Do not add fields outside the schema.

Avoid relying on generic instructions such as “be accurate” or “make sure the JSON is valid.” They are too broad to debug. Use concrete rules tied to specific fields and failure cases.

Version every prompt and schema change

Changing a tool description or prompt without versioning makes regressions hard to explain. If tool argument quality improves today and breaks next week, you need to know what changed.

Track these as versioned artifacts:

System prompt
Developer prompt
Tool descriptions
JSON schemas
Context construction logic
Validation and retry behavior
Model name and settings

Do not compare results across untracked changes. You will waste time guessing whether the schema, prompt, context, or model caused the difference.

Build evals that target bad arguments

Testing only happy paths gives you false confidence. Add eval cases that force the model to handle missing IDs, ambiguous names, invalid dates, enum boundaries, and user requests that should not call a tool.

A small eval set can catch most argument regressions. Start with 20 to 50 cases per important tool. Include real production failures as soon as you see them.

Example eval table

Case	Input	Expected behavior	Before fix	After fix
Known customer	Create a high priority billing ticket for Acme Corp due tomorrow.	Call `create_ticket` with `customer_id=cus_1042`, `priority=high`, ISO date.	Failed: used customer name and natural language date.	Passed
Ambiguous customer	Create a ticket for Acme.	Ask follow-up question because Acme Corp and Acme Supplies both match.	Failed: guessed `cus_1042`.	Passed
Invalid priority wording	Make this ticket super urgent.	Map to `high` only if policy allows it, otherwise ask for confirmation.	Failed: sent `urgent`.	Passed
Missing due date	Create a billing ticket for Acme Corp.	Ask for due date or use documented default if product policy allows it.	Failed: omitted `due_date`.	Passed
Unsupported action	Delete all unpaid invoices for Acme Corp.	Do not call `create_ticket`. Refuse or route to approved workflow.	Failed: created vague ticket.	Passed

Track both exact-match and behavior-level metrics:

Valid argument rate: percentage of tool calls that pass schema validation.
Correct tool rate: percentage of cases where the model chose the expected tool or no tool.
Required field completion: percentage of calls with all required fields present.
Invalid enum rate: percentage of calls with invented enum values.
Unsafe guess rate: percentage of calls where the model guessed IDs or ambiguous fields.

For example, a useful before-and-after result might look like this:

Metric	Before	After schema and prompt fix
Valid argument rate	71%	96%
Correct tool rate	84%	93%
Invalid enum rate	12%	1%
Unsafe guess rate	9%	2%

Use production traces to expand your test set

Your eval set should not stay static. Every bad argument in production is a candidate test case.

When a tool call fails, save the full debugging bundle:

User input
Prompt version
Tool schema version
Retrieved or injected context
Raw tool call arguments
Validation error
Final user-visible response

Turn repeated failures into regression tests. If the model keeps sending "urgent" instead of "high", add enum-specific eval cases. If it keeps guessing customer IDs, add ambiguity cases.

A practical fix workflow

Use this order when you see bad tool arguments:

Capture the failure: trace the prompt, context, tool call, schema, and validation error.
Classify the issue: missing field, wrong type, invalid enum, ambiguous ID, invalid JSON, or wrong tool.
Tighten the schema: add required fields, enums, formats, patterns, descriptions, and additionalProperties: false.
Fix context: pass stable IDs, current date, user scope, permissions, and lookup results in a structured form.
Improve tool descriptions: state when to call the tool, when not to call it, and how to handle missing data.
Add validation: reject bad payloads before backend execution.
Add a bounded retry: allow the model to correct recoverable errors using the validation message.
Add eval cases: include the failure and nearby edge cases.
Version the change: record prompt, schema, model, and context changes together.
Compare results: run the same eval set before and after the fix.

Do not hide bad arguments behind backend defaults

Backend defaults can be useful, but they can also hide model failures. If your API defaults missing priority to medium, you may never notice that the model failed to classify urgency. If your backend accepts customer names and picks the first match, you may create tickets on the wrong account.

Use defaults only when they are part of the product contract. If a default affects user data, billing, permissions, or external actions, require the model or user to provide the value explicitly.

Production checklist

Before you trust an LLM tool in production, make sure you can answer yes to these questions:

Does every tool have a specific description?
Are tools narrow enough that arguments are easy to validate?
Are all required IDs and context values available before the tool call?
Does the schema reject unknown fields?
Are enums explicit?
Are dates, IDs, currency, and quantities constrained?
Does validation run before backend execution?
Are validation errors captured in traces?
Does the system avoid guessing when required context is missing?
Do evals include failure cases, ambiguous cases, and no-tool cases?
Are prompt and schema changes versioned?

The bottom line

Bad tool arguments are usually a systems problem, not a model personality problem. Fix the contract first. Make the schema strict, make context explicit, validate every payload, trace every failure, and test the edge cases that break real workflows.

Once you do that, prompt edits become targeted. You can see which rule failed, which version introduced the regression, and whether the fix improved production behavior.

PromptLayer helps AI teams trace tool calls, version prompts and schemas, run evals, and compare changes before they reach production. If you are building LLM-powered tools or agents, create a PromptLayer account to start tracking and improving your tool argument quality.

How to Apply Prompt Engineering Best Practices

How to Define a Prompt for an LLM App

How to Fix Bad Tool Arguments

How to Fix Bad Tool Arguments

What bad tool arguments look like

Start by finding the exact failure

Example trace with validation error

Common mistake 1: vague tool descriptions

Common mistake 2: overly broad tools

Common mistake 3: hidden required context

Common mistake 4: accepting invalid JSON silently

Fix the schema before you tune the prompt

Before: loose schema

After: stricter schema

Add a repair loop for recoverable errors

Use prompt changes only after the contract is clear

Version every prompt and schema change

Build evals that target bad arguments

Example eval table

Use production traces to expand your test set

A practical fix workflow

Do not hide bad arguments behind backend defaults

Production checklist

The bottom line

How to Track Prompt Engineering News

How to Apply Agentic Meaning to LLM Apps

How to Define a Prompt for an LLM App

The first platform built for prompt engineering

Usage

Company

Follow Us

How to Fix Bad Tool Arguments

How to Fix Bad Tool Arguments

What bad tool arguments look like

Start by finding the exact failure

Example trace with validation error

Common mistake 1: vague tool descriptions

Common mistake 2: overly broad tools

Common mistake 3: hidden required context

Common mistake 4: accepting invalid JSON silently

Fix the schema before you tune the prompt

Before: loose schema

After: stricter schema

Add a repair loop for recoverable errors

Use prompt changes only after the contract is clear

Version every prompt and schema change

Build evals that target bad arguments

Example eval table

Use production traces to expand your test set

A practical fix workflow

Do not hide bad arguments behind backend defaults

Production checklist

The bottom line

RECENT ARTICLES

The first platform built for prompt engineering

Usage

Company

Follow Us