How to Trace LLM Calls in Production
How to Trace LLM Calls in Production
Production LLM failures are often hard to debug because the final answer hides the path that produced it. A bad response may come from the prompt, retrieved context, model parameters, tool output, schema parsing, retries, or a stale prompt version.
Tracing gives your team a step-by-step record of what happened during an LLM request. A useful trace should answer practical questions:
- Which prompt version ran?
- Which model and parameters were used?
- What retrieval results were injected into the context?
- Which tools were called?
- Where did latency increase?
- Which step failed?
- What did the model see, and what did it return?
Good tracing is a core part of LLM observability, but it should not replace evals, monitoring, alerts, or product analytics. Traces help you debug individual executions. Evals help you measure quality across many executions.
What an LLM Trace Should Capture
A production LLM trace should represent the full request path, not only the final model response. For most applications, you want one top-level trace per user request or background job. Inside that trace, create spans for each meaningful operation.
Recommended span types
- Request span: User request, route, tenant, environment, request ID, session ID, and latency.
- Prompt span: Prompt template name, prompt version ID, variables, release label, and commit or deployment ID.
- Retrieval span: Query, index name, filters, returned document IDs, scores, and token counts.
- Model span: Provider, model, parameters, input token count, output token count, cost, latency, finish reason, and response format.
- Tool span: Tool name, arguments, status, latency, return payload summary, and error details.
- Parser span: JSON parsing, schema validation, repair attempts, and final structured output.
- Eval span: Online checks, policy checks, LLM judge scores, or deterministic assertions.
Use span names that match how engineers talk about your system. For example, retrieve_policy_docs, call_support_agent_model, and validate_ticket_json are easier to debug than generic names like step_1 and llm_call.
Example: Trace Timeline for a Support Agent
Here is a compact example of a trace timeline for a customer support agent that retrieves policy documents, calls a model, attempts a refund tool call, and fails because of invalid tool arguments.
Trace: support_agent.request
Trace ID: trc_8f91c2
Environment: production
User ID: user_8421
Prompt: support_agent_v3
Prompt Version ID: prv_2026_06_03_17
Model: gpt-4.1-mini
0ms ├─ request.start
8ms ├─ auth.check_user_entitlements ok 8ms
19ms ├─ prompt.load ok 11ms
42ms ├─ retrieval.search_policy_docs ok 23ms
43ms │ ├─ query: "refund delayed package"
43ms │ ├─ index: "support_policy_prod"
43ms │ └─ docs: doc_102 score=0.91, doc_087 score=0.84
91ms ├─ llm.call.plan_response ok 49ms
96ms ├─ tool.call.create_refund error 5ms
96ms │ ├─ error: INVALID_ARGUMENT
96ms │ └─ reason: refund_amount_cents must be <= order_total_cents
128ms ├─ llm.call.recover_after_tool_error ok 32ms
133ms └─ response.sent ok 5ms
Total latency: 133ms
Status: degraded_successThis trace is useful because it shows where the failure happened. The first model call was fast. Retrieval worked. The refund tool failed because the model produced invalid arguments. The recovery model call then handled the failure and returned a safer answer to the user.
Use Nested Spans for Agents and Chains
Flat logs break down quickly when you ship agents, routing logic, retrieval, parallel calls, or multi-step workflows. Nested spans let you group related work under a parent operation.
support_agent.request
├─ load_prompt
│ ├─ fetch_template
│ └─ render_variables
├─ retrieve_context
│ ├─ embed_query
│ └─ vector_search
├─ agent_loop
│ ├─ llm.call.decide_next_action
│ ├─ tool.call.get_order
│ ├─ llm.call.decide_refund
│ ├─ tool.call.create_refund
│ └─ llm.call.final_answer
└─ postprocess
├─ validate_response
└─ save_conversation_summaryIf you use prompt chains or compiler-style planning, tracing becomes more important. A chain can fail in a planner prompt, a generated intermediate step, or a downstream tool. If your team is working with compiled workflows, review the concept of an LLM compiler and trace the generated steps as first-class spans.
Add Prompt and Version Metadata to Every Trace
Missing prompt version IDs is one of the most common tracing mistakes. If a response fails and the trace only says support_prompt, your team cannot tell which template produced the output.
Attach prompt metadata to the prompt span and the model span. That makes it possible to compare latency, cost, and quality by prompt version.
{
"trace_id": "trc_8f91c2",
"span_name": "llm.call.plan_response",
"attributes": {
"prompt.name": "support_agent_v3",
"prompt.version_id": "prv_2026_06_03_17",
"prompt.release_label": "production",
"prompt.git_sha": "9c1a77e",
"model.provider": "openai",
"model.name": "gpt-4.1-mini",
"model.temperature": 0.2,
"model.max_output_tokens": 600,
"app.environment": "production",
"app.route": "/api/support/chat",
"app.tenant_id": "tenant_431",
"deployment.id": "deploy_2026_06_06_04"
}
}At minimum, include these fields:
- Prompt name: A stable human-readable name, such as
invoice_extraction_v2. - Prompt version ID: An immutable version identifier.
- Release label: For example,
production,staging, orcanary. - Model name: The exact model used.
- Parameters: Temperature, max tokens, response format, seed, tool choice, and timeout.
- Deployment ID: The app version that made the call.
Trace Retrieval and Tool Calls
Many LLM bugs start outside the model. If you skip retrieval and tool spans, you may blame the prompt when the real issue is stale context, an empty search result, a bad filter, or a tool schema mismatch.
Retrieval spans
For retrieval-augmented generation, trace the retrieval query, filters, index, document IDs, scores, and token counts. Avoid storing full raw documents if they contain sensitive data. Store hashes, IDs, titles, snippets, or redacted summaries instead.
{
"span_name": "retrieval.search_policy_docs",
"status": "ok",
"latency_ms": 23,
"attributes": {
"retrieval.index": "support_policy_prod",
"retrieval.query_redacted": "refund delayed package",
"retrieval.top_k": 5,
"retrieval.filter": {
"locale": "en-US",
"policy_version": "2026-05"
},
"retrieval.results": [
{
"document_id": "doc_102",
"score": 0.91,
"tokens": 312
},
{
"document_id": "doc_087",
"score": 0.84,
"tokens": 228
}
]
}
}Tool spans
Tool calls should record the tool name, arguments, result status, latency, and error type. Redact sensitive arguments before storing them.
{
"span_name": "tool.call.create_refund",
"status": "error",
"latency_ms": 5,
"attributes": {
"tool.name": "create_refund",
"tool.version": "2026-04-18",
"tool.arguments_redacted": {
"order_id": "ord_9132",
"refund_amount_cents": 12999,
"reason": "delayed_package"
},
"tool.error_code": "INVALID_ARGUMENT",
"tool.error_message": "refund_amount_cents must be <= order_total_cents",
"tool.retryable": false
}
}Failed tool calls are especially important for agents. A model may recover gracefully, but your system still needs to track the failed step. Otherwise, you will miss silent reliability problems.
Instrument the LLM Call Path
You can implement tracing with OpenTelemetry-style spans, your own logging wrapper, or an AI engineering platform. The key is consistency. Every LLM call should pass through the same wrapper so you do not rely on each engineer to remember the right fields.
TypeScript example
async function runSupportAgent(input: {
userId: string;
tenantId: string;
message: string;
}) {
return tracer.startActiveSpan("support_agent.request", async (traceSpan) => {
traceSpan.setAttributes({
"app.environment": process.env.NODE_ENV,
"app.tenant_id": input.tenantId,
"user.id_hash": hashUserId(input.userId)
});
try {
const prompt = await tracer.startActiveSpan("prompt.load", async (span) => {
const loadedPrompt = await promptStore.get("support_agent_v3", {
label: "production"
});
span.setAttributes({
"prompt.name": loadedPrompt.name,
"prompt.version_id": loadedPrompt.versionId,
"prompt.release_label": "production"
});
return loadedPrompt;
});
const docs = await tracer.startActiveSpan("retrieval.search_policy_docs", async (span) => {
const results = await searchPolicyDocs({
query: redact(input.message),
topK: 5
});
span.setAttributes({
"retrieval.index": "support_policy_prod",
"retrieval.top_k": 5,
"retrieval.result_count": results.length,
"retrieval.document_ids": results.map((doc) => doc.id)
});
return results;
});
const response = await tracer.startActiveSpan("llm.call.plan_response", async (span) => {
span.setAttributes({
"model.provider": "openai",
"model.name": "gpt-4.1-mini",
"model.temperature": 0.2,
"prompt.version_id": prompt.versionId
});
const completion = await openai.responses.create({
model: "gpt-4.1-mini",
input: renderPrompt(prompt, {
message: input.message,
policyDocs: docs
}),
temperature: 0.2
});
span.setAttributes({
"model.input_tokens": completion.usage?.input_tokens,
"model.output_tokens": completion.usage?.output_tokens,
"model.finish_reason": completion.output?.[0]?.finish_reason
});
return completion;
});
traceSpan.setStatus({ code: SpanStatusCode.OK });
return response;
} catch (error) {
traceSpan.recordException(error);
traceSpan.setStatus({
code: SpanStatusCode.ERROR,
message: error instanceof Error ? error.message : "Unknown error"
});
throw error;
} finally {
traceSpan.end();
}
});
}This pattern keeps tracing close to the workflow code without scattering logging statements across every file. You can adapt the same approach for Python, background jobs, batch evaluation runs, or agent loops.
Redact Sensitive Data Before It Enters Your Trace Store
Production traces often contain user messages, retrieved text, tool arguments, internal notes, and model outputs. Some of that data may include emails, names, addresses, API keys, payment details, medical information, or confidential business data.
Do not log raw sensitive data by default. Redact or hash it before you send it to your tracing backend.
Practical redaction rules
- Hash user IDs and account IDs when exact values are not needed for debugging.
- Redact emails, phone numbers, tokens, API keys, and payment identifiers.
- Store document IDs and retrieval scores instead of full private documents.
- Store short snippets only when they are safe and useful.
- Apply retention rules. For example, keep full debug traces for 7 days and metadata-only traces for 90 days.
- Restrict access to traces that include prompt inputs or outputs.
function redact(input: string): string {
return input
.replace(/[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}/gi, "[email_redacted]")
.replace(/\b\d{3}-\d{2}-\d{4}\b/g, "[ssn_redacted]")
.replace(/\b(?:sk|pk)_[A-Za-z0-9_]{16,}\b/g, "[api_key_redacted]");
}Track Cost, Latency, and Quality Signals
A trace should make debugging easier, but it should also support operational review. Attach cost and latency metadata to model spans so your team can answer questions like:
- Did the new prompt version increase output tokens?
- Are retries driving up cost?
- Which tool calls add the most latency?
- Which model produces the most schema validation failures?
- Do failed retrieval calls correlate with lower answer quality?
For quality, add lightweight online checks where they fit. For example, you might attach a schema validation result, a refusal classifier, a toxicity check, or an LLM judge score. If you use judge models, define the rubric clearly and track judge model versions too. Read more about LLM-as-a-judge if you use model-based scoring in production or offline evals.
Tracing and LLM evaluation should work together. A failed trace can become an eval example. A failing eval can link back to representative traces. This creates a practical loop between debugging and regression testing.
Set Sampling Rules for Production
You rarely need to store every detail for every request forever. Full tracing can get expensive and noisy. Use sampling rules that match risk and traffic volume.
Example sampling plan
- 100% of errors: Store full traces for failed requests, failed tool calls, parser errors, and timeout events.
- 100% of canary releases: Store full traces for new prompt versions during the first few hours or first 1,000 requests.
- 10% of normal production traffic: Store detailed traces for routine successful requests.
- 1% of high-volume low-risk traffic: Store metadata-only traces for cheap aggregate analysis.
- On demand: Temporarily increase sampling for a tenant, route, model, or prompt version during an incident.
Sample at the trace level when possible. If you sample each span independently, you may keep a model span without the retrieval or tool spans that explain it.
Avoid Over-Instrumenting Noisy Events
Too much tracing can make production debugging harder. You do not need a span for every string concatenation, every token streamed to the client, or every small helper function.
Create spans for operations that have at least one of these traits:
- They call an external service.
- They can fail independently.
- They add meaningful latency.
- They change the model input.
- They affect user-visible output.
- They help explain cost or quality.
For streaming responses, avoid logging every token as an event unless you are debugging a narrow issue. A better default is to record first-token latency, total output tokens, completion status, and final redacted output.
Use Traces During Incidents
When an LLM incident happens, traces help you reduce guesswork. Start with a small set of failing traces and compare them against successful traces for the same route, prompt, and model.
Useful incident questions
- Did failures start after a prompt release or model change?
- Are failures isolated to one tenant, locale, route, or retrieval index?
- Did tool latency increase before model timeouts started?
- Are parser failures tied to a specific model response format?
- Did retrieval return empty or low-score results?
- Did the model call use the intended prompt version?
For example, if refund requests start failing after a prompt update, filter traces by prompt.version_id. Then compare tool arguments generated by the old and new prompt versions. You may find that the new prompt stopped instructing the model to cap refund amounts at the order total.
Turn Production Failures Into Test Cases
A trace should not end its life as a debugging artifact. When you find a meaningful failure, convert it into a regression test.
- Find the failed trace.
- Extract the redacted user input, prompt version, retrieved document IDs, tool responses, and expected behavior.
- Add it to an evaluation dataset.
- Run it against the current prompt and candidate prompt changes.
- Keep the trace link attached to the dataset example.
This workflow helps prevent repeated failures. It also gives prompt changes a clearer release process. Before shipping a new prompt version, run it against real cases that previously failed.
Common Mistakes When Tracing LLM Calls
Tracing only the final model response
If you only store the final response, you miss the prompt, retrieval context, tools, retries, and parser steps that shaped it. Trace the full workflow.
Logging raw sensitive data
Raw prompts and outputs can contain private data. Redact before storage. Use access controls and retention policies.
Missing prompt version IDs
Prompt names are not enough. Store immutable prompt version IDs on every relevant span.
Ignoring retrieval and tool spans
RAG and agent failures often come from retrieval or tools. Trace them as first-class operations.
Over-instrumenting low-value events
Too many spans create noise and cost. Focus on operations that affect output, reliability, latency, or spend.
Treating tracing as a replacement for evals
Traces explain individual executions. Evals measure behavior across examples. You need both for production LLM systems.
Production Trace Checklist
- Create one trace per user request, job, or agent run.
- Use nested spans for prompt loading, retrieval, model calls, tool calls, parsing, and postprocessing.
- Attach prompt name, prompt version ID, release label, model, parameters, and deployment ID.
- Record latency, token usage, cost, status, finish reason, and retry count.
- Trace retrieval queries, filters, document IDs, scores, and result counts.
- Trace tool arguments in redacted form, tool versions, errors, and return statuses.
- Redact sensitive data before storage.
- Sample successful traffic, but keep full traces for errors and canaries.
- Link traces to eval examples when failures become regression tests.
- Review trace quality during every prompt or agent release.
Final Takeaway
Tracing LLM calls in production gives your team the execution history behind each response. The best traces show prompt versions, retrieval context, model parameters, tool calls, errors, latency, cost, and quality checks in one place.
Start with the core workflow. Trace the steps that change model input, call external systems, add latency, or affect user-visible output. Keep sensitive data out of your trace store. Then connect traces to evals so production failures turn into better tests.
PromptLayer helps AI teams manage prompts, trace LLM requests, inspect prompt versions, debug agent workflows, and connect production behavior back to evaluations. If you are shipping LLM-powered applications, create a PromptLayer account at https://dashboard.promptlayer.com/create-account.