Back

How to Add Observability to AI Agents

Jun 05, 2026
How to Add Observability to AI Agents

How to Add Observability to AI Agents

Agent observability starts with one goal: when an agent produces a bad answer, stalls, burns too much budget, or calls the wrong tool, you should be able to replay the path that led there.

For a simple chat completion, logging the prompt, response, model, latency, and cost may be enough. For an agent, it is not. A single user request can include planning, multiple model calls, tool selection, tool arguments, retrieval, memory reads and writes, retries, validation, guardrails, and final response synthesis.

This tutorial walks through a practical implementation path for adding observability to LLM-powered agents. The examples assume you already know tool calling, traces, prompts, and production logging. The focus is on what to capture, where to capture it, and how to make the data useful for debugging and evaluation.

1. Define the agent run as your top-level trace

Start by treating each user request as one agent run. The run should have a stable ID that follows every planning step, LLM call, tool call, memory operation, retry, and final response.

A good top-level agent trace includes:

  • run_id: Unique ID for this agent execution.
  • user_id or account_id: Redacted or hashed if needed.
  • session_id: Useful for multi-turn agents.
  • agent_name: For example, support_refund_agent.
  • agent_version: Git SHA, deployment version, or prompt release version.
  • environment: production, staging, or dev.
  • input: The user request, after applying your privacy policy.
  • final_output: The response sent back to the user or system.
  • status: success, failed, timeout, cancelled, or needs_review.
  • total_latency_ms: Wall-clock runtime.
  • total_cost_usd: Total estimated model and tool cost.

If you use PromptLayer, this top-level run can sit above your logged LLM requests and prompt versions. PromptLayer’s observability tooling is useful here because agent failures usually require both trace context and prompt-level detail.

2. Break the agent run into spans

Once you have a top-level run, divide the execution into spans. A span is a timed unit of work inside the run. For agents, these spans should match the actual reasoning and system actions.

Use span types like these:

  • planner: The model call or code path that decides what to do next.
  • llm_call: Any direct model request.
  • tool_call: API calls, database queries, browser actions, code execution, or internal services.
  • memory_read: Vector search, user profile lookup, previous conversation retrieval.
  • memory_write: Saved preferences, task state, summaries, or extracted facts.
  • validation: JSON schema checks, policy checks, output parsers, or business rule checks.
  • retry: A repeated attempt after parsing failure, tool failure, timeout, or low-confidence output.
  • final_response: The last synthesis step before returning output.

A trace with these spans lets you answer concrete questions:

  • Did the agent choose the wrong plan?
  • Did the planner choose the right tool but pass bad arguments?
  • Did a tool return stale or incomplete data?
  • Did memory retrieval inject irrelevant context?
  • Did retries fix the issue or make latency worse?
  • Did the final response ignore valid tool results?

This structure works for static agents with fixed flows, dynamic agents that choose actions at runtime, and plan-and-execute systems. If your team is comparing patterns, it helps to understand the difference between static agents, dynamic agents, and plan-and-execute agents because each style creates different observability needs.

3. Use a consistent event schema

Do not let every service log agent events differently. Create a shared schema and enforce it in your agent framework, SDK wrapper, or internal middleware.

Here is a practical span schema you can adapt:

{
  "run_id": "run_123",
  "span_id": "span_456",
  "parent_span_id": "span_001",
  "span_type": "tool_call",
  "name": "lookup_order",
  "status": "success",
  "started_at": "2026-06-05T14:05:10.120Z",
  "ended_at": "2026-06-05T14:05:10.481Z",
  "latency_ms": 361,
  "input": {
    "order_id": "ord_789"
  },
  "output": {
    "status": "shipped",
    "delivery_date": "2026-06-08"
  },
  "metadata": {
    "agent_version": "2026-06-05.3",
    "environment": "production",
    "attempt": 1
  },
  "error": null
}

For LLM calls, add model-specific fields:

{
  "span_type": "llm_call",
  "name": "planner_step",
  "model": "gpt-4.1",
  "prompt_version": "refund-agent-planner:v17",
  "temperature": 0.2,
  "input_tokens": 1840,
  "output_tokens": 312,
  "cost_usd": 0.0124,
  "finish_reason": "tool_calls",
  "tool_calls_requested": ["lookup_order"],
  "status": "success"
}

Keep the schema boring and stable. You can add fields over time, but avoid frequent renames. Dashboards, alerts, eval datasets, and incident queries depend on these names.

4. Instrument the planning step

The planner is often where agent behavior changes the most. If you only log tool calls and final responses, you will miss the decision that caused the failure.

For each planning step, capture:

  • Planner prompt: System prompt, developer instructions, selected context, and tool definitions.
  • Planner output: The plan, selected tool, arguments, or next action.
  • Available tools: The tool names and schemas visible to the model at that step.
  • Decision metadata: Confidence score if you compute one, selected policy, or routing result.
  • Step number: For example, step_index: 3.

For a customer support agent, a bad trace might show this:

  1. User asks for a refund on a delayed order.
  2. Planner sees tools: lookup_order, check_refund_policy, issue_refund.
  3. Planner calls issue_refund immediately.
  4. Tool rejects the call because the order is ineligible without a policy check.
  5. Agent retries twice and then gives a vague apology.

Without planner observability, this looks like a tool failure. With planner observability, you can see the real issue: the prompt or policy instructions did not force the agent to check refund eligibility before issuing money.

5. Capture tool calls with arguments, outputs, and safety controls

Tool calls are the action layer of your agent. They also create many production incidents. Log enough detail to debug behavior, but apply redaction before storing sensitive data.

For every tool call, capture:

  • Tool name: For example, search_docs or send_email.
  • Tool version: Important when APIs or schemas change.
  • Input arguments: Redacted where needed.
  • Raw output: Or a sanitized summary if raw output contains sensitive data.
  • Latency: Tool latency often dominates total agent latency.
  • Status code: HTTP status, database status, or internal error code.
  • Side effect flag: Whether the tool changed external state.
  • Idempotency key: Required for tools that send emails, create tickets, charge cards, or update records.

Separate read-only tools from side-effect tools. A failed search_docs call is usually safe to retry. A failed charge_customer call needs stricter handling because the external system may have processed the request even if your agent timed out.

For side-effect tools, log the confirmation response from the external system. If the agent calls send_refund_email, store the email provider’s message ID. If it creates a support ticket, store the ticket ID. These fields make incident review much easier.

6. Track memory reads and writes

Memory bugs can be hard to spot because the agent may behave logically based on bad or irrelevant context. You need observability for both memory retrieval and memory updates.

For memory reads, capture:

  • Query text: The search query or generated retrieval query.
  • Memory source: Conversation history, vector database, CRM, user profile, or task state.
  • Top results: IDs, scores, titles, and short snippets.
  • Selection rule: Top-k, threshold, recency filter, or reranker output.
  • Injected context: The exact memory content added to the prompt, with redaction.

For memory writes, capture:

  • Written content: The fact, summary, preference, or state update.
  • Write reason: Why the agent saved it.
  • Expiration policy: Permanent, session-only, or time-limited.
  • Source span: The model call or tool result that produced the memory.

Example: a travel booking agent remembers that a user prefers morning flights. That memory may be useful. If the agent stores “user wants cheapest flight” after one constrained search, future recommendations may become worse. Logging the write reason helps you catch this class of error.

7. Make retries visible

Retries can hide reliability problems. A request may succeed after three attempts, but the user still waits 18 seconds and your model bill triples.

Track retries as first-class spans. For each retry, capture:

  • Attempt number: 1, 2, 3.
  • Retry reason: Timeout, rate limit, invalid JSON, schema mismatch, tool error, low confidence, or policy violation.
  • Changed input: Whether you modified the prompt, tool arguments, or model parameters.
  • Backoff delay: Time spent waiting before retry.
  • Final outcome: Whether the retry solved the issue.

Use retry observability to set limits. For example:

  • Allow 1 retry for invalid JSON.
  • Allow 2 retries for transient HTTP 429 responses.
  • Allow 0 automatic retries for payment, refund, or account-deletion tools unless you have idempotency keys.
  • Escalate to review after 3 failed planning loops.

In dashboards, separate “eventual success” from “success on first attempt.” A high eventual success rate can still mean users experience slow or unstable workflows.

8. Measure latency by component

Total latency is useful for alerting, but it is too broad for debugging. Break latency down by step.

Track at least these timing metrics:

  • prompt_assembly_ms: Time spent building messages and context.
  • memory_read_ms: Vector search, database lookup, and reranking time.
  • planner_llm_ms: Time spent in planning model calls.
  • tool_latency_ms: Time spent waiting on external tools.
  • validation_ms: Schema checks, guardrails, and output parsing.
  • final_llm_ms: Time spent generating the final response.
  • queue_ms: Time before execution starts, if you use workers.

Then create service-level targets. For example:

  • p95 total latency under 8 seconds for interactive support agents.
  • p95 planner step under 2 seconds.
  • p95 tool latency under 1 second for internal APIs.
  • p99 timeout rate below 1%.

These numbers will vary by product. A coding agent can take longer than a chat support agent. A background research agent can run for minutes. The point is to define targets that match the user experience and business process.

9. Attribute cost to steps, tools, prompts, and customers

Agent cost can rise quickly because each run may contain many model calls. You need cost visibility at the span level, not only at the request level.

Capture:

  • Input tokens and output tokens for every model call.
  • Model name and provider.
  • Prompt version used for each call.
  • Cached token counts if your provider reports them.
  • Estimated cost using your current price table.
  • Cost by agent version so you can detect regressions after deployment.

Cost observability helps you catch changes like these:

  • A new planner prompt adds 3,000 tokens to every run.
  • A retrieval change injects 20 documents instead of 5.
  • A retry bug doubles model calls for one customer segment.
  • A tool schema becomes too large and appears in every planning prompt.

When you ship prompt changes through PromptLayer, attach prompt versions to traces. That lets you compare cost, latency, and quality before and after a prompt release.

10. Classify failures in a way engineers can act on

A generic failed status does not help much. Use failure categories that point to likely fixes.

Practical categories include:

  • planner_error: The agent selected the wrong next action.
  • tool_argument_error: The agent chose the right tool but passed invalid arguments.
  • tool_runtime_error: The tool failed because of an API, network, auth, or service issue.
  • memory_error: The agent used missing, stale, or irrelevant context.
  • format_error: The model returned invalid JSON or failed a schema check.
  • policy_error: The agent violated a product, safety, or compliance rule.
  • timeout: The run exceeded the time budget.
  • loop_detected: The agent repeated actions without progress.
  • user_input_error: The user request lacked required information.

Add a short failure summary when possible. For example:

{
  "status": "failed",
  "failure_type": "tool_argument_error",
  "failure_summary": "Planner called issue_refund without required order_id.",
  "failed_span_id": "span_009"
}

This structure makes alert routing easier. Tool runtime errors may go to the platform team. Planner errors may go to the AI engineering team. Policy errors may need review by product or legal stakeholders.

11. Add agent-specific evaluations

Logs show what happened. Evaluations help you decide whether it was good.

For agents, evaluate more than the final answer. Add checks at multiple points in the trace:

  • Plan quality: Did the agent choose a valid sequence of actions?
  • Tool choice: Did it select the correct tool for the task?
  • Argument correctness: Did it pass complete and valid tool arguments?
  • Context use: Did it use relevant retrieved information?
  • Faithfulness: Did the final response match tool outputs?
  • Task success: Did the agent complete the user’s goal?
  • Safety and policy: Did it avoid disallowed actions?

You can score these with deterministic checks, LLM-as-judge evaluators, or review queues. Use deterministic checks wherever possible. For example, if a refund agent must call check_refund_policy before issue_refund, test that directly from the trace.

Save failed or borderline traces into datasets. Then run them before deploying prompt, model, retrieval, or tool changes. This turns production failures into regression tests.

12. Connect traces to prompts and prompt versions

Agent observability becomes much more useful when every LLM span points to the exact prompt version used at runtime.

For each model call, log:

  • Prompt template name.
  • Prompt version or commit hash.
  • Rendered messages after variable injection.
  • Input variables.
  • Model parameters.
  • Tool schemas included in the request.

This matters during incidents. If the refund agent started failing after yesterday’s deployment, you should be able to compare traces before and after the prompt change. Look for differences in instructions, tool schemas, examples, retrieved context, and output format requirements.

PromptLayer is built around this workflow: prompt management, request logging, evaluations, and trace visibility in one place. You can use it to connect production behavior back to the prompt and dataset changes that caused it.

13. Add observability at the agent framework boundary

The best place to instrument is usually the framework boundary, not scattered throughout business logic. Wrap the core actions your agent can perform.

At minimum, create wrappers for:

  • call_model()
  • call_tool()
  • read_memory()
  • write_memory()
  • validate_output()
  • retry_step()

Here is a simplified Python-style example:

async def run_agent(user_input, user_id):
    run = start_agent_run(
        agent_name="support_refund_agent",
        user_id=user_id,
        input=user_input
    )

    try:
        memory = await trace_span(
            run_id=run.id,
            span_type="memory_read",
            name="load_customer_context",
            fn=lambda: load_customer_context(user_id)
        )

        plan = await trace_span(
            run_id=run.id,
            span_type="llm_call",
            name="planner_step",
            metadata={"prompt_version": "refund-agent-planner:v17"},
            fn=lambda: call_planner(user_input, memory)
        )

        tool_result = await trace_span(
            run_id=run.id,
            span_type="tool_call",
            name=plan.tool_name,
            input=plan.tool_args,
            fn=lambda: call_tool(plan.tool_name, plan.tool_args)
        )

        final = await trace_span(
            run_id=run.id,
            span_type="llm_call",
            name="final_response",
            metadata={"prompt_version": "refund-agent-final:v9"},
            fn=lambda: generate_final_response(user_input, tool_result)
        )

        finish_agent_run(run.id, status="success", output=final)
        return final

    except Exception as error:
        finish_agent_run(
            run.id,
            status="failed",
            failure_type=classify_error(error),
            error=str(error)
        )
        raise

If you use OpenAI’s agent tooling, you can pair framework-level traces with PromptLayer’s OpenAI Agents SDK integration so model calls and agent steps stay connected.

14. Redact sensitive data before storage

Agent traces can contain user messages, API responses, internal records, and generated tool arguments. Treat observability data as production data.

Add redaction before logs leave your application boundary. Common fields to redact include:

  • Email addresses.
  • Phone numbers.
  • Access tokens and API keys.
  • Payment information.
  • Health, legal, or financial records.
  • Internal credentials in tool outputs.

Use allowlists for high-risk tools. For example, if a billing API returns 40 fields and you only need invoice_status, amount_due, and invoice_id for debugging, store only those fields.

Also add access controls. Engineers debugging model behavior may need rendered prompts and tool summaries. They may not need full customer records.

15. Build dashboards around engineering questions

Dashboards should help your team decide what to fix. Avoid vanity charts that show request volume but do not explain behavior.

Useful agent dashboards include:

  • Run success rate: Broken down by agent version, model, customer tier, and environment.
  • Failure type distribution: Planner errors, tool errors, memory errors, format errors, timeouts.
  • Latency by span type: Planner, tools, memory, validation, final response.
  • Cost by agent and prompt version: Daily cost, p95 cost per run, and cost per successful task.
  • Retry rate: Attempts per run and retry reasons.
  • Tool error rate: By tool name and version.
  • Loop rate: Runs where the agent exceeded your step threshold or repeated the same action.
  • Eval scores: Task success, tool choice accuracy, final answer faithfulness, and policy pass rate.

Pair each dashboard with a threshold. For example:

  • Alert when p95 latency exceeds 10 seconds for 15 minutes.
  • Alert when tool runtime errors exceed 3% for a critical tool.
  • Alert when average cost per successful run increases by 25% after deployment.
  • Alert when planner errors exceed 5% on reviewed traces.

16. Use traces during incident review

When an agent incident occurs, review a sample of traces instead of guessing from aggregate metrics.

A practical incident workflow:

  1. Filter failed or high-latency runs by time window, agent version, and customer segment.
  2. Group by failure type.
  3. Open representative traces for the largest groups.
  4. Check planner decisions before tool calls.
  5. Compare tool arguments with tool schemas.
  6. Check memory snippets injected into prompts.
  7. Compare prompt versions before and after the issue started.
  8. Save useful examples to an eval dataset.
  9. Ship the fix and run the dataset before redeploying.

For example, suppose a support agent starts telling users their orders are delayed when they are not. Traces may show that the retrieval step is pulling stale shipment updates from old conversations. The fix may be a memory filter, not a model change.

17. Roll out observability in stages

You do not need perfect tracing on day one. Roll it out in layers.

Stage 1: Basic run logging

  • Run ID.
  • User input and final output.
  • Status.
  • Total latency.
  • Total cost.
  • Agent version.

Stage 2: LLM and tool spans

  • Each model call.
  • Prompt version.
  • Token usage.
  • Each tool call.
  • Tool arguments and outputs.
  • Failure categories.

Stage 3: Memory, retries, and validation

  • Memory reads and writes.
  • Retry attempts and reasons.
  • Schema validation results.
  • Loop detection.
  • Side-effect tracking.

Stage 4: Evals and regression datasets

  • Trace-level evals.
  • Planner and tool-choice evals.
  • Production failure datasets.
  • Pre-deployment regression runs.
  • Dashboards tied to release health.

This staged approach keeps the work manageable. It also gives you useful debugging data early, before you add more advanced evaluation workflows.

Agent observability checklist

Use this checklist when you instrument your next agent release:

  • Every run has a unique run_id.
  • Every span includes run_id, span_id, type, status, start time, end time, and latency.
  • Planner steps log prompts, available tools, selected actions, and outputs.
  • Tool calls log arguments, outputs, status, latency, version, and side-effect risk.
  • Memory reads log queries, retrieved items, scores, and injected context.
  • Memory writes log content, reason, source, and expiration policy.
  • Retries log attempt number, reason, changed input, delay, and outcome.
  • LLM calls log model, prompt version, parameters, tokens, cost, and finish reason.
  • Failures use actionable categories.
  • Sensitive data is redacted before storage.
  • Dashboards break down success, latency, cost, retries, and failures by agent version.
  • Production failures can be saved into eval datasets.

Final thoughts

Good agent observability gives your team a clear path from a bad production outcome to the exact plan, prompt, tool call, memory item, retry, or validation step that caused it. Start with structured traces. Add prompt versions, tool details, memory events, retries, latency, cost, and failure categories. Then connect those traces to evaluations so each incident improves your test coverage.

The result is a tighter engineering loop: observe production behavior, identify the failing step, fix the prompt or system logic, test against real examples, and ship with more confidence.


If you want to manage prompts, trace agent runs, inspect LLM requests, and connect production failures to evaluations, create a PromptLayer account at https://dashboard.promptlayer.com/create-account.

The first platform built for prompt engineering