Tracing LLM Calls in Production: A Guide for Developers and AI Teams

How to Trace LLM Calls in Production

Production LLM failures are often hard to debug because the final answer hides the path that produced it. A bad response may come from the prompt, retrieved context, model parameters, tool output, schema parsing, retries, or a stale prompt version.

Tracing gives your team a step-by-step record of what happened during an LLM request. A useful trace should answer practical questions:

Which prompt version ran?
Which model and parameters were used?
What retrieval results were injected into the context?
Which tools were called?
Where did latency increase?
Which step failed?
What did the model see, and what did it return?

Good tracing is a core part of LLM observability, but it should not replace evals, monitoring, alerts, or product analytics. Traces help you debug individual executions. Evals help you measure quality across many executions.

What an LLM Trace Should Capture

A production LLM trace should represent the full request path, not only the final model response. For most applications, you want one top-level trace per user request or background job. Inside that trace, create spans for each meaningful operation.

Recommended span types

Request span: User request, route, tenant, environment, request ID, session ID, and latency.
Prompt span: Prompt template name, prompt version ID, variables, release label, and commit or deployment ID.
Retrieval span: Query, index name, filters, returned document IDs, scores, and token counts.
Model span: Provider, model, parameters, input token count, output token count, cost, latency, finish reason, and response format.
Tool span: Tool name, arguments, status, latency, return payload summary, and error details.
Parser span: JSON parsing, schema validation, repair attempts, and final structured output.
Eval span: Online checks, policy checks, LLM judge scores, or deterministic assertions.

Use span names that match how engineers talk about your system. For example, retrieve_policy_docs, call_support_agent_model, and validate_ticket_json are easier to debug than generic names like step_1 and llm_call.

Example: Trace Timeline for a Support Agent

Here is a compact example of a trace timeline for a customer support agent that retrieves policy documents, calls a model, attempts a refund tool call, and fails because of invalid tool arguments.

Trace: support_agent.request
Trace ID: trc_8f91c2
Environment: production
User ID: user_8421
Prompt: support_agent_v3
Prompt Version ID: prv_2026_06_03_17
Model: gpt-4.1-mini

0ms      ├─ request.start
8ms      ├─ auth.check_user_entitlements                  ok       8ms
19ms     ├─ prompt.load                                   ok      11ms
42ms     ├─ retrieval.search_policy_docs                  ok      23ms
43ms     │   ├─ query: "refund delayed package"
43ms     │   ├─ index: "support_policy_prod"
43ms     │   └─ docs: doc_102 score=0.91, doc_087 score=0.84
91ms     ├─ llm.call.plan_response                        ok      49ms
96ms     ├─ tool.call.create_refund                       error    5ms
96ms     │   ├─ error: INVALID_ARGUMENT
96ms     │   └─ reason: refund_amount_cents must be <= order_total_cents
128ms    ├─ llm.call.recover_after_tool_error             ok      32ms
133ms    └─ response.sent                                 ok       5ms

Total latency: 133ms
Status: degraded_success

Example trace timeline showing retrieval, model calls, nested tool execution, and a failed tool call.

This trace is useful because it shows where the failure happened. The first model call was fast. Retrieval worked. The refund tool failed because the model produced invalid arguments. The recovery model call then handled the failure and returned a safer answer to the user.

Use Nested Spans for Agents and Chains

Flat logs break down quickly when you ship agents, routing logic, retrieval, parallel calls, or multi-step workflows. Nested spans let you group related work under a parent operation.

support_agent.request
├─ load_prompt
│  ├─ fetch_template
│  └─ render_variables
├─ retrieve_context
│  ├─ embed_query
│  └─ vector_search
├─ agent_loop
│  ├─ llm.call.decide_next_action
│  ├─ tool.call.get_order
│  ├─ llm.call.decide_refund
│  ├─ tool.call.create_refund
│  └─ llm.call.final_answer
└─ postprocess
   ├─ validate_response
   └─ save_conversation_summary

Nested spans make agent loops easier to inspect because each decision, tool call, and recovery step stays attached to the same request.

If you use prompt chains or compiler-style planning, tracing becomes more important. A chain can fail in a planner prompt, a generated intermediate step, or a downstream tool. If your team is working with compiled workflows, review the concept of an LLM compiler and trace the generated steps as first-class spans.

Add Prompt and Version Metadata to Every Trace

Missing prompt version IDs is one of the most common tracing mistakes. If a response fails and the trace only says support_prompt, your team cannot tell which template produced the output.

Attach prompt metadata to the prompt span and the model span. That makes it possible to compare latency, cost, and quality by prompt version.

{
  "trace_id": "trc_8f91c2",
  "span_name": "llm.call.plan_response",
  "attributes": {
    "prompt.name": "support_agent_v3",
    "prompt.version_id": "prv_2026_06_03_17",
    "prompt.release_label": "production",
    "prompt.git_sha": "9c1a77e",
    "model.provider": "openai",
    "model.name": "gpt-4.1-mini",
    "model.temperature": 0.2,
    "model.max_output_tokens": 600,
    "app.environment": "production",
    "app.route": "/api/support/chat",
    "app.tenant_id": "tenant_431",
    "deployment.id": "deploy_2026_06_06_04"
  }
}

Example prompt and model metadata attached to a production LLM span.

At minimum, include these fields:

Prompt name: A stable human-readable name, such as invoice_extraction_v2.
Prompt version ID: An immutable version identifier.
Release label: For example, production, staging, or canary.
Model name: The exact model used.
Parameters: Temperature, max tokens, response format, seed, tool choice, and timeout.
Deployment ID: The app version that made the call.

Trace Retrieval and Tool Calls

Many LLM bugs start outside the model. If you skip retrieval and tool spans, you may blame the prompt when the real issue is stale context, an empty search result, a bad filter, or a tool schema mismatch.

Retrieval spans

For retrieval-augmented generation, trace the retrieval query, filters, index, document IDs, scores, and token counts. Avoid storing full raw documents if they contain sensitive data. Store hashes, IDs, titles, snippets, or redacted summaries instead.

{
  "span_name": "retrieval.search_policy_docs",
  "status": "ok",
  "latency_ms": 23,
  "attributes": {
    "retrieval.index": "support_policy_prod",
    "retrieval.query_redacted": "refund delayed package",
    "retrieval.top_k": 5,
    "retrieval.filter": {
      "locale": "en-US",
      "policy_version": "2026-05"
    },
    "retrieval.results": [
      {
        "document_id": "doc_102",
        "score": 0.91,
        "tokens": 312
      },
      {
        "document_id": "doc_087",
        "score": 0.84,
        "tokens": 228
      }
    ]
  }
}

Example retrieval span with document IDs and scores instead of full raw document text.

Tool spans

Tool calls should record the tool name, arguments, result status, latency, and error type. Redact sensitive arguments before storing them.

{
  "span_name": "tool.call.create_refund",
  "status": "error",
  "latency_ms": 5,
  "attributes": {
    "tool.name": "create_refund",
    "tool.version": "2026-04-18",
    "tool.arguments_redacted": {
      "order_id": "ord_9132",
      "refund_amount_cents": 12999,
      "reason": "delayed_package"
    },
    "tool.error_code": "INVALID_ARGUMENT",
    "tool.error_message": "refund_amount_cents must be <= order_total_cents",
    "tool.retryable": false
  }
}

Example failed tool call span with a clear error code and redacted arguments.

Failed tool calls are especially important for agents. A model may recover gracefully, but your system still needs to track the failed step. Otherwise, you will miss silent reliability problems.

Instrument the LLM Call Path

You can implement tracing with OpenTelemetry-style spans, your own logging wrapper, or an AI engineering platform. The key is consistency. Every LLM call should pass through the same wrapper so you do not rely on each engineer to remember the right fields.

TypeScript example

async function runSupportAgent(input: {
  userId: string;
  tenantId: string;
  message: string;
}) {
  return tracer.startActiveSpan("support_agent.request", async (traceSpan) => {
    traceSpan.setAttributes({
      "app.environment": process.env.NODE_ENV,
      "app.tenant_id": input.tenantId,
      "user.id_hash": hashUserId(input.userId)
    });

    try {
      const prompt = await tracer.startActiveSpan("prompt.load", async (span) => {
        const loadedPrompt = await promptStore.get("support_agent_v3", {
          label: "production"
        });

        span.setAttributes({
          "prompt.name": loadedPrompt.name,
          "prompt.version_id": loadedPrompt.versionId,
          "prompt.release_label": "production"
        });

        return loadedPrompt;
      });

      const docs = await tracer.startActiveSpan("retrieval.search_policy_docs", async (span) => {
        const results = await searchPolicyDocs({
          query: redact(input.message),
          topK: 5
        });

        span.setAttributes({
          "retrieval.index": "support_policy_prod",
          "retrieval.top_k": 5,
          "retrieval.result_count": results.length,
          "retrieval.document_ids": results.map((doc) => doc.id)
        });

        return results;
      });

      const response = await tracer.startActiveSpan("llm.call.plan_response", async (span) => {
        span.setAttributes({
          "model.provider": "openai",
          "model.name": "gpt-4.1-mini",
          "model.temperature": 0.2,
          "prompt.version_id": prompt.versionId
        });

        const completion = await openai.responses.create({
          model: "gpt-4.1-mini",
          input: renderPrompt(prompt, {
            message: input.message,
            policyDocs: docs
          }),
          temperature: 0.2
        });

        span.setAttributes({
          "model.input_tokens": completion.usage?.input_tokens,
          "model.output_tokens": completion.usage?.output_tokens,
          "model.finish_reason": completion.output?.[0]?.finish_reason
        });

        return completion;
      });

      traceSpan.setStatus({ code: SpanStatusCode.OK });
      return response;
    } catch (error) {
      traceSpan.recordException(error);
      traceSpan.setStatus({
        code: SpanStatusCode.ERROR,
        message: error instanceof Error ? error.message : "Unknown error"
      });
      throw error;
    } finally {
      traceSpan.end();
    }
  });
}

Example TypeScript tracing wrapper for prompt loading, retrieval, and an LLM call.

This pattern keeps tracing close to the workflow code without scattering logging statements across every file. You can adapt the same approach for Python, background jobs, batch evaluation runs, or agent loops.

Redact Sensitive Data Before It Enters Your Trace Store

Production traces often contain user messages, retrieved text, tool arguments, internal notes, and model outputs. Some of that data may include emails, names, addresses, API keys, payment details, medical information, or confidential business data.

Do not log raw sensitive data by default. Redact or hash it before you send it to your tracing backend.

Practical redaction rules

Hash user IDs and account IDs when exact values are not needed for debugging.
Redact emails, phone numbers, tokens, API keys, and payment identifiers.
Store document IDs and retrieval scores instead of full private documents.
Store short snippets only when they are safe and useful.
Apply retention rules. For example, keep full debug traces for 7 days and metadata-only traces for 90 days.
Restrict access to traces that include prompt inputs or outputs.

function redact(input: string): string {
  return input
    .replace(/[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}/gi, "[email_redacted]")
    .replace(/\b\d{3}-\d{2}-\d{4}\b/g, "[ssn_redacted]")
    .replace(/\b(?:sk|pk)_[A-Za-z0-9_]{16,}\b/g, "[api_key_redacted]");
}

Example redaction function for common sensitive fields. Production systems usually need stricter rules.

Track Cost, Latency, and Quality Signals

A trace should make debugging easier, but it should also support operational review. Attach cost and latency metadata to model spans so your team can answer questions like:

Did the new prompt version increase output tokens?
Are retries driving up cost?
Which tool calls add the most latency?
Which model produces the most schema validation failures?
Do failed retrieval calls correlate with lower answer quality?

For quality, add lightweight online checks where they fit. For example, you might attach a schema validation result, a refusal classifier, a toxicity check, or an LLM judge score. If you use judge models, define the rubric clearly and track judge model versions too. Read more about LLM-as-a-judge if you use model-based scoring in production or offline evals.

Tracing and LLM evaluation should work together. A failed trace can become an eval example. A failing eval can link back to representative traces. This creates a practical loop between debugging and regression testing.

Set Sampling Rules for Production

You rarely need to store every detail for every request forever. Full tracing can get expensive and noisy. Use sampling rules that match risk and traffic volume.

Example sampling plan

100% of errors: Store full traces for failed requests, failed tool calls, parser errors, and timeout events.
100% of canary releases: Store full traces for new prompt versions during the first few hours or first 1,000 requests.
10% of normal production traffic: Store detailed traces for routine successful requests.
1% of high-volume low-risk traffic: Store metadata-only traces for cheap aggregate analysis.
On demand: Temporarily increase sampling for a tenant, route, model, or prompt version during an incident.

Sample at the trace level when possible. If you sample each span independently, you may keep a model span without the retrieval or tool spans that explain it.

Avoid Over-Instrumenting Noisy Events

Too much tracing can make production debugging harder. You do not need a span for every string concatenation, every token streamed to the client, or every small helper function.

Create spans for operations that have at least one of these traits:

They call an external service.
They can fail independently.
They add meaningful latency.
They change the model input.
They affect user-visible output.
They help explain cost or quality.

For streaming responses, avoid logging every token as an event unless you are debugging a narrow issue. A better default is to record first-token latency, total output tokens, completion status, and final redacted output.

Use Traces During Incidents

When an LLM incident happens, traces help you reduce guesswork. Start with a small set of failing traces and compare them against successful traces for the same route, prompt, and model.

Useful incident questions

Did failures start after a prompt release or model change?
Are failures isolated to one tenant, locale, route, or retrieval index?
Did tool latency increase before model timeouts started?
Are parser failures tied to a specific model response format?
Did retrieval return empty or low-score results?
Did the model call use the intended prompt version?

For example, if refund requests start failing after a prompt update, filter traces by prompt.version_id. Then compare tool arguments generated by the old and new prompt versions. You may find that the new prompt stopped instructing the model to cap refund amounts at the order total.

Turn Production Failures Into Test Cases

A trace should not end its life as a debugging artifact. When you find a meaningful failure, convert it into a regression test.

Find the failed trace.
Extract the redacted user input, prompt version, retrieved document IDs, tool responses, and expected behavior.
Add it to an evaluation dataset.
Run it against the current prompt and candidate prompt changes.
Keep the trace link attached to the dataset example.

This workflow helps prevent repeated failures. It also gives prompt changes a clearer release process. Before shipping a new prompt version, run it against real cases that previously failed.

Common Mistakes When Tracing LLM Calls

Tracing only the final model response

If you only store the final response, you miss the prompt, retrieval context, tools, retries, and parser steps that shaped it. Trace the full workflow.

Logging raw sensitive data

Raw prompts and outputs can contain private data. Redact before storage. Use access controls and retention policies.

Missing prompt version IDs

Prompt names are not enough. Store immutable prompt version IDs on every relevant span.

Ignoring retrieval and tool spans

RAG and agent failures often come from retrieval or tools. Trace them as first-class operations.

Over-instrumenting low-value events

Too many spans create noise and cost. Focus on operations that affect output, reliability, latency, or spend.

Treating tracing as a replacement for evals

Traces explain individual executions. Evals measure behavior across examples. You need both for production LLM systems.

Production Trace Checklist

Create one trace per user request, job, or agent run.
Use nested spans for prompt loading, retrieval, model calls, tool calls, parsing, and postprocessing.
Attach prompt name, prompt version ID, release label, model, parameters, and deployment ID.
Record latency, token usage, cost, status, finish reason, and retry count.
Trace retrieval queries, filters, document IDs, scores, and result counts.
Trace tool arguments in redacted form, tool versions, errors, and return statuses.
Redact sensitive data before storage.
Sample successful traffic, but keep full traces for errors and canaries.
Link traces to eval examples when failures become regression tests.
Review trace quality during every prompt or agent release.

Final Takeaway

Tracing LLM calls in production gives your team the execution history behind each response. The best traces show prompt versions, retrieval context, model parameters, tool calls, errors, latency, cost, and quality checks in one place.

Start with the core workflow. Trace the steps that change model input, call external systems, add latency, or affect user-visible output. Keep sensitive data out of your trace store. Then connect traces to evals so production failures turn into better tests.

PromptLayer helps AI teams manage prompts, trace LLM requests, inspect prompt versions, debug agent workflows, and connect production behavior back to evaluations. If you are shipping LLM-powered applications, create a PromptLayer account at https://dashboard.promptlayer.com/create-account.

How to Mine Awesome LLM for Evals

How to Track LLM Usage, Cost, and Quality

How to Trace LLM Calls in Production

How to Trace LLM Calls in Production

What an LLM Trace Should Capture

Recommended span types

Example: Trace Timeline for a Support Agent

Use Nested Spans for Agents and Chains

Add Prompt and Version Metadata to Every Trace

Trace Retrieval and Tool Calls

Retrieval spans

Tool spans

Instrument the LLM Call Path

TypeScript example

Redact Sensitive Data Before It Enters Your Trace Store

Practical redaction rules

Track Cost, Latency, and Quality Signals

Set Sampling Rules for Production

Example sampling plan

Avoid Over-Instrumenting Noisy Events

Use Traces During Incidents

Useful incident questions

Turn Production Failures Into Test Cases

Common Mistakes When Tracing LLM Calls

Tracing only the final model response

Logging raw sensitive data

Missing prompt version IDs

Ignoring retrieval and tool spans

Over-instrumenting low-value events

Treating tracing as a replacement for evals

Production Trace Checklist

Final Takeaway

How to Pilot an Enterprise LLM Visibility Platform

How to Track LLM Analytics in PostHog

How to Choose LLM Tracking Tools

The first platform built for prompt engineering

Usage

Company

Follow Us

How to Trace LLM Calls in Production

How to Trace LLM Calls in Production

What an LLM Trace Should Capture

Recommended span types

Example: Trace Timeline for a Support Agent

Use Nested Spans for Agents and Chains

Add Prompt and Version Metadata to Every Trace

Trace Retrieval and Tool Calls

Retrieval spans

Tool spans

Instrument the LLM Call Path

TypeScript example

Redact Sensitive Data Before It Enters Your Trace Store

Practical redaction rules

Track Cost, Latency, and Quality Signals

Set Sampling Rules for Production

Example sampling plan

Avoid Over-Instrumenting Noisy Events

Use Traces During Incidents

Useful incident questions

Turn Production Failures Into Test Cases

Common Mistakes When Tracing LLM Calls

Tracing only the final model response

Logging raw sensitive data

Missing prompt version IDs

Ignoring retrieval and tool spans

Over-instrumenting low-value events

Treating tracing as a replacement for evals

Production Trace Checklist

Final Takeaway

RECENT ARTICLES

The first platform built for prompt engineering

Usage

Company

Follow Us