Building an Anthropic Agent Loop: Key Steps and Common Pitfalls

An Anthropic agent loop is the runtime pattern that lets Claude reason, call tools, read tool results, and continue until it can produce a final answer. The loop is simple in shape, but production reliability depends on tight tool schemas, clear stop conditions, visible state, and strong evaluation.

If you are building with Claude, the basic loop looks like this:

Send Claude a user task, a system prompt, and a list of tools.
Claude either returns a final answer or requests a tool call.
Your application validates and executes the tool call.
Your application sends the tool result back to Claude.
The loop repeats until Claude returns a final answer or your guardrails stop execution.

This pattern powers research agents, support agents, coding workflows, internal copilots, and task automation systems. It also creates new failure modes. A weak loop can run forever, call unsafe tools, hide broken state, or let the model fabricate data that should have come from a tool.

The minimum Anthropic agent loop

At runtime, your application owns the loop. Claude decides when it wants to use a tool, but your code decides whether the tool call is allowed, how it runs, what result gets returned, and when the loop stops.

A useful loop diagram should show these steps clearly:

User request: The original task or instruction.
Claude call: The messages, system prompt, and tool definitions sent to Anthropic.
Tool decision: Whether Claude returns text or a tool_use block.
Tool executor: Your application code that validates and runs the tool.
Tool result: The structured result sent back to Claude.
Stop condition: Final answer, max turns, timeout, budget cap, or policy failure.

For a post, docs page, or internal design review, include a screenshot or diagram of this loop. It helps engineers see where model behavior ends and application control begins.

Define the agent goal in plain, testable terms

A common mistake is giving the agent a vague goal such as “help the user solve their problem.” That gives you no clean way to test success or failure.

Use a goal that names the task, allowed actions, and completion criteria:

You are a support triage agent.

Goal:
Classify the incoming support request, search the customer account record, and draft a response for a support rep.

You may:
- Look up customer account metadata.
- Search existing support tickets.
- Draft a suggested reply.

You must not:
- Send messages to customers.
- Change account settings.
- Guess account data if tools fail.

You are done when:
- You have assigned one category.
- You have included supporting evidence.
- You have drafted a response for review.

This kind of prompt is easier to evaluate than a broad assistant prompt. It also keeps your loop safer because Claude has a narrower job.

Keep the system prompt small

Overloaded system prompts are one of the fastest ways to make an agent brittle. Teams often pack policies, examples, tool docs, formatting rules, business logic, and fallback behavior into one long prompt. The model then misses key instructions or follows the wrong one at the wrong time.

Use the system prompt for durable rules:

The agent’s role.
The task boundary.
Safety constraints.
Rules for tool use.
Output format requirements.

Move volatile context into user messages, retrieval results, or tool outputs. Put complex business logic in code when possible. If a rule must be exact, enforce it outside the model.

Design tool schemas Claude can use correctly

Tool schemas should be narrow, typed, and hard to misuse. Avoid generic tools such as run_action with a freeform action_name. Give Claude explicit tools with strict input schemas.

Example Anthropic tool schema:

{
  "name": "search_support_tickets",
  "description": "Search prior support tickets for a customer by customer_id and short query.",
  "input_schema": {
    "type": "object",
    "properties": {
      "customer_id": {
        "type": "string",
        "description": "The internal customer ID, such as cus_12345."
      },
      "query": {
        "type": "string",
        "description": "A short search query. Maximum 12 words."
      },
      "limit": {
        "type": "integer",
        "description": "Maximum number of tickets to return.",
        "minimum": 1,
        "maximum": 5
      }
    },
    "required": ["customer_id", "query", "limit"],
    "additionalProperties": false
  }
}

Include a screenshot or example of your tool schema in the article or internal docs. Engineers need to see the exact contract, not a summary.

Build the loop in application code

The following Python example shows the core pattern. Treat it as a skeleton. In production, add retries, structured logging, auth checks, eval hooks, and tracing.

import os
import json
import anthropic

client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

MODEL = "claude-3-5-sonnet-20240620"
MAX_TURNS = 8

tools = [
    {
        "name": "search_support_tickets",
        "description": "Search prior support tickets for a customer by customer_id and short query.",
        "input_schema": {
            "type": "object",
            "properties": {
                "customer_id": {"type": "string"},
                "query": {"type": "string"},
                "limit": {"type": "integer", "minimum": 1, "maximum": 5}
            },
            "required": ["customer_id", "query", "limit"],
            "additionalProperties": False
        }
    }
]

system_prompt = """
You are a support triage agent.

Use tools when you need account or ticket data.
Do not invent tool results.
If a tool fails, say what failed and continue only if you have enough information.
Return a JSON object with: category, evidence, draft_response.
"""

def execute_tool(name, tool_input):
    if name == "search_support_tickets":
        return search_support_tickets(
            customer_id=tool_input["customer_id"],
            query=tool_input["query"],
            limit=tool_input["limit"]
        )

    raise ValueError(f"Tool not allowed: {name}")

def search_support_tickets(customer_id, query, limit):
    # Replace this with your database, API, or search call.
    return {
        "tickets": [
            {
                "id": "ticket_8841",
                "title": "Billing page shows stale invoice status",
                "status": "closed"
            }
        ]
    }

def run_agent(user_request):
    messages = [
        {
            "role": "user",
            "content": user_request
        }
    ]

    for turn in range(MAX_TURNS):
        response = client.messages.create(
            model=MODEL,
            max_tokens=1200,
            system=system_prompt,
            messages=messages,
            tools=tools
        )

        messages.append({
            "role": "assistant",
            "content": response.content
        })

        tool_uses = [
            block for block in response.content
            if block.type == "tool_use"
        ]

        if not tool_uses:
            return response.content

        tool_results = []

        for tool_use in tool_uses:
            try:
                result = execute_tool(tool_use.name, tool_use.input)
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": tool_use.id,
                    "content": json.dumps(result)
                })
            except Exception as error:
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": tool_use.id,
                    "is_error": True,
                    "content": str(error)
                })

        messages.append({
            "role": "user",
            "content": tool_results
        })

    raise RuntimeError("Agent stopped: max turns reached")

The important detail is that the application returns tool results using the matching tool_use_id. Claude should never be asked to make up the result of a tool call. Your code must run the tool and return the result.

If you already use Claude in your stack, PromptLayer’s Anthropic integration can help you capture requests, responses, prompt versions, and traces around these loops.

Add hard stop conditions

Missing stop conditions turn small bugs into expensive incidents. A production loop should stop for more than one reason.

Use at least these guards:

Max turns: For example, stop after 8 model calls.
Wall-clock timeout: For example, stop after 30 seconds.
Tool budget: For example, allow no more than 5 search calls.
Cost budget: Stop when estimated token cost exceeds a request limit.
Repeated tool call detection: Stop if Claude calls the same tool with the same input 3 times.
Policy failure: Stop if the requested action requires a permission the user does not have.

When the loop stops early, return a structured failure reason. Do not hide it behind a generic assistant response.

{
  "status": "stopped",
  "reason": "repeated_tool_call",
  "tool": "search_support_tickets",
  "input_hash": "a91c4e",
  "turn": 6
}

Lock down tool permissions

Unsafe tool permissions are a high-risk agent bug. If Claude can call tools that send emails, issue refunds, delete data, or modify permissions, your application needs approval gates and scoped access.

Separate tools into risk levels:

Read-only tools: Search tickets, fetch account metadata, retrieve docs.
Draft tools: Create a proposed email, generate a report, prepare a patch.
Write tools: Send email, update billing, change user settings.
Destructive tools: Delete records, revoke access, cancel accounts.

For most agents, start with read-only and draft tools. Add write tools only after you have traces, evals, and approval flows. The model should request an action, but your application should decide whether that action can run.

Make state visible and debuggable

Hidden state makes agent bugs hard to reproduce. If the loop stores memory, retrieved context, intermediate plans, selected tools, or transformed inputs, record them in a trace.

A useful trace view should include:

The system prompt version.
The full message list sent to Claude.
Tool schemas available on each turn.
Each tool_use request.
Validated tool inputs.
Raw tool results.
Final output.
Stop reason.
Latency, token counts, and estimated cost.

Include a screenshot of a trace view when documenting the agent. The best trace views let an engineer answer: “What did Claude know at this turn, what did it ask to do, what did our code return, and why did the loop continue?”

For larger systems, read more about AI agent orchestration. The same tracing and control concerns become more important as you add routers, planners, workers, and evaluators.

Handle failed tool calls as first-class events

Tool failures are normal. APIs time out. Search returns no results. Auth fails. Inputs are invalid. Your loop should make these failures visible to Claude and to your logs.

Use structured tool errors:

{
  "error_type": "not_found",
  "message": "No customer found for customer_id cus_99999",
  "retryable": false
}

Then tell Claude how to behave when tools fail:

If a tool returns an error:
- Do not invent missing data.
- Explain which dependency failed.
- Ask for missing required input if the user can provide it.
- Stop if the task cannot be completed safely.

Add a failed run analysis example to your article or internal docs. Show the original request, the tool failure, the model’s next action, and the prompt or schema change that fixed the issue.

Do not evaluate only happy paths

An agent that passes five clean demos is not production-ready. You need eval cases that attack the loop, tool use, and stop behavior.

Create a small eval set with cases like these:

Normal path: The user provides all required IDs and the tools return valid data.
Missing input: The customer ID is absent.
Tool no-results: Search returns an empty list.
Tool timeout: One tool call fails with a retryable error.
Permission boundary: The user asks the agent to perform a write action it cannot perform.
Prompt injection: Retrieved content tells the model to ignore prior instructions.
Loop trap: A task encourages repeated searches without new information.
Bad tool argument: The model tries to pass an invalid enum or extra field.

Score both final outputs and intermediate behavior. For example:

Did Claude call the correct tool?
Were tool arguments valid?
Did the agent stop within the turn limit?
Did it avoid inventing tool data?
Did it return a useful failure when blocked?

Refine prompts with before and after examples

Prompt refinement works best when tied to failed runs. Do not rewrite the prompt because it “feels vague.” Start with a trace and identify the exact failure.

Before

You can use the tools to help answer the user. Be accurate and helpful.

Observed failure

Claude called search_support_tickets, received an empty result, then wrote: “The customer has had several billing issues in the past.” That claim was not supported by tool output.

After

Use tool results as the only source for customer-specific claims.

If a tool returns no results:
- Say that no matching records were found.
- Do not infer prior issues.
- Ask for another identifier or search term if needed.

Include a before and after prompt refinement screenshot when possible. It helps reviewers see that the change came from a real failure, not a general preference.

When to use multiple agents

Start with one loop unless you have a clear reason to split responsibilities. A single well-instrumented agent is easier to test than several agents passing messages around.

Consider multiple agents when you have distinct tasks with different prompts, tools, or evaluation criteria. For example, a support workflow might use one agent to classify the ticket, one to retrieve evidence, and one to draft the response. If you go this route, read about multi-agent systems and agent-to-agent communication before adding more moving parts.

Production checklist

Before shipping an Anthropic agent loop, check these items:

The agent goal is specific and testable.
The system prompt is short enough to review.
Each tool has a strict schema with additionalProperties disabled where possible.
Tool execution uses an allowlist.
Write and destructive tools require extra approval.
The loop has max turns, timeout, budget, and repeated-call guards.
Claude never invents tool outputs.
Tool failures are returned as structured results.
Every run has a trace with prompts, tool calls, tool results, and stop reason.
Evals cover failed tools, missing inputs, permission issues, and prompt injection.
Prompt changes are tied to failed runs and versioned.

The core loop is not complicated. The engineering work sits around it: state control, tool validation, traces, evals, and careful prompt iteration. If you build those pieces early, your Anthropic agent will be much easier to debug and safer to ship.

PromptLayer helps teams manage prompts, trace Anthropic agent loops, evaluate behavior, and debug failed runs before they reach users. Create an account at https://dashboard.promptlayer.com/create-account to start tracking and improving your agent workflows.

How to Set Up AI Evaluation for LLM Apps

How to Build an Anthropic Agent Loop

The minimum Anthropic agent loop

Define the agent goal in plain, testable terms

Keep the system prompt small

Design tool schemas Claude can use correctly

Build the loop in application code

Add hard stop conditions

Lock down tool permissions

Make state visible and debuggable

Handle failed tool calls as first-class events

Do not evaluate only happy paths

Refine prompts with before and after examples

Before

Observed failure

After

When to use multiple agents

Production checklist

How to Set Up AI Evaluation for LLM Apps

How to Build an AI Engineering Stack

How to Refine AI Context in LLM Apps

The first platform built for prompt engineering

Usage

Company

Follow Us

How to Build an Anthropic Agent Loop

The minimum Anthropic agent loop

Define the agent goal in plain, testable terms

Keep the system prompt small

Design tool schemas Claude can use correctly

Build the loop in application code

Add hard stop conditions

Lock down tool permissions

Make state visible and debuggable

Handle failed tool calls as first-class events

Do not evaluate only happy paths

Refine prompts with before and after examples

Before

Observed failure

After

When to use multiple agents

Production checklist

RECENT ARTICLES

The first platform built for prompt engineering

Usage

Company

Follow Us