How to Trace LLM Calls in Production
How to Trace LLM Calls in Production
Tracing LLM calls in production means recording what happened during each model-powered request: the prompt, model, parameters, response, latency, token usage, tool calls, retrieval context, errors, retries, and final output. For teams shipping LLM applications, this is the difference between guessing why a user got a bad answer and knowing exactly which prompt version, context chunk, or tool response caused the issue.
Traditional application logs are not enough. An LLM request often includes multiple steps: prompt assembly, retrieval, model calls, function calls, agent planning, structured output parsing, guardrails, evaluators, and fallbacks. If you only log the final response, you lose the path that produced it.
A good trace gives you a timeline of the full workflow. When a user reports a bad answer, you should be able to open one request and answer these questions in minutes:
- Which prompt template and version ran?
- What variables were passed into the prompt?
- Which model and parameters were used?
- What retrieved documents or context were included?
- Did the model call any tools?
- Did any tool fail, timeout, or return bad data?
- Was there a retry or fallback?
- How many tokens did the request use?
- How long did each step take?
- Which output did the user actually see?
What to Trace in an LLM Request
Start by tracing the data that helps you debug real production failures. Avoid logging everything without a plan, especially if your application handles private customer data. Your trace should be useful, searchable, and safe.
1. Request metadata
Every trace should include stable identifiers that let you connect the LLM call to your application state.
- Trace ID: one ID for the full request or workflow.
- Span ID: one ID for each step inside the workflow.
- User ID or account ID: preferably hashed or internal-only.
- Environment: production, staging, preview, or local.
- Feature name: support bot, code agent, sales assistant, report generator.
- Release version: app version, git SHA, or deployment ID.
Example: if a customer reports that your support bot gave the wrong refund policy, you need to connect that conversation to the exact deployed code and prompt version that generated it.
2. Prompt version and input variables
Prompt changes are code changes. If your traces do not record prompt versions, you will struggle to separate model behavior from prompt drift.
For each model call, capture:
- The prompt template name.
- The prompt version or commit hash.
- The rendered prompt, if safe to store.
- The input variables used to render the prompt.
- The system, developer, and user messages as separate fields when using chat models.
Storing the rendered prompt is useful, but it can also contain sensitive data. Many teams store both a redacted prompt for search and a restricted full prompt for approved debugging.
3. Model configuration
Small model configuration changes can cause large behavior changes. Your trace should record the exact configuration used for each call.
- Provider, such as OpenAI, Anthropic, Google, or a self-hosted model.
- Model name, such as
gpt-4.1,claude-3-5-sonnet, or an internal model ID. - Temperature.
- Top-p.
- Max tokens.
- Response format or JSON schema.
- Tool definitions passed to the model.
- Timeout settings.
This matters when you compare a working trace to a failing trace. A support answer might degrade because someone increased temperature from 0.2 to 0.8, changed the model, or removed a schema constraint.
4. Retrieval context
For RAG applications, the retrieved context often explains the final answer. Trace the retrieval step, not only the final LLM call.
Capture:
- The user query sent to your retriever.
- The rewritten query, if you use query rewriting.
- Embedding model version.
- Vector index name and version.
- Top-k value.
- Document IDs returned.
- Similarity scores.
- The exact chunks included in the prompt, with redaction where needed.
If a model gives an incorrect answer, the root cause may be retrieval. The model may have received stale documentation, irrelevant chunks, or no useful context at all.
5. Tool calls and agent steps
Agents add more places for failures to hide. A model may choose the wrong tool, pass malformed arguments, call tools in the wrong order, or trust a tool result that should have been rejected.
Trace each tool call with:
- Tool name.
- Arguments passed to the tool.
- Validation errors.
- Tool response.
- Latency.
- Retries.
- Final status: success, failure, timeout, or skipped.
For agentic workflows, also capture planner outputs, intermediate reasoning summaries if your model and policy allow it, branch choices, and stop conditions. You do not need to store private chain-of-thought. Store concise step summaries that explain what the system did.
6. Output and post-processing
The raw model response is often not the final user-visible response. Your application may parse JSON, enforce a schema, run a guardrail, call an evaluator, rewrite the answer, or trigger a fallback.
Record:
- Raw model output.
- Parsed output.
- Parser errors.
- Validation results.
- Guardrail decisions.
- Evaluator scores.
- Final response sent to the user.
This helps you find issues where the model did the right thing, but your parser or post-processing logic broke the response.
Use Spans for Multi-Step LLM Workflows
A trace should represent one full user request. Spans should represent each step inside that request. This structure gives you a readable timeline and lets you measure latency at the right level.
For example, a customer support agent trace might look like this:
- HTTP request received with user message and conversation ID.
- Intent classification using a small model.
- Policy retrieval from a vector database.
- Main answer generation using retrieved chunks.
- Refund policy check using a structured evaluator.
- Final response sent to the user.
If the request takes 9 seconds, spans tell you where the time went. Maybe retrieval took 300 ms, the main model call took 5 seconds, and an evaluator retry took 3 seconds. Without spans, you only see a slow request.
Connect Tracing With LLM Observability
Tracing is one part of a broader production monitoring setup. You also need aggregate views, alerting, evaluation, and cost tracking. If you are defining this system for your team, it helps to separate tracing from LLM observability. Tracing explains one request. Observability helps you understand patterns across many requests.
Useful production metrics include:
- Error rate: failed model calls, parser failures, tool errors, and timeout rates.
- Latency: p50, p90, and p99 by feature, model, and prompt version.
- Cost: input tokens, output tokens, and estimated spend per request.
- Fallback rate: how often your app uses backup prompts, backup models, or safe responses.
- Evaluation score: quality, correctness, format compliance, safety, or task-specific scores.
- User feedback: thumbs up, thumbs down, edits, escalations, or support tickets.
A practical dashboard might show that version 12 of your support prompt has a 4.8 percent parser failure rate, while version 11 had 0.6 percent. That gives you a specific regression to inspect.
Add Evaluations to Your Traces
Tracing tells you what happened. Evaluations help you decide whether the output was good. For production LLM systems, you should attach evaluation results to the trace whenever possible.
Common evaluation types include:
- Exact checks: JSON schema validity, required fields, banned phrases, citation presence.
- Reference-based checks: compare the output to a known correct answer.
- Retrieval checks: confirm the answer uses the provided context.
- LLM-based grading: ask a separate model to score the response against a rubric.
- Business checks: detect refund promises, legal advice, medical advice, or pricing errors.
If you are building your first evaluation layer, start with LLM evaluation for high-risk or high-volume workflows. Use simple deterministic checks first, then add model-based grading where rules are too rigid.
For example, a RAG answer can be evaluated with three checks:
- Does the answer include at least one citation?
- Are all cited document IDs present in the retrieved context?
- Does an evaluator model score the answer as grounded in the provided context?
For subjective tasks, an LLM-as-a-judge setup can work well if you use clear rubrics, calibration examples, and periodic review against real user outcomes.
Protect Sensitive Data in Traces
Production traces can contain private user messages, customer records, API responses, and internal business data. Treat trace storage as production data storage.
Use these controls before you log full LLM payloads:
- Redact sensitive fields: remove passwords, API keys, payment data, access tokens, and private identifiers.
- Hash stable IDs: keep joins possible without exposing raw user identifiers.
- Set retention windows: for example, keep full traces for 14 or 30 days, then keep only metrics.
- Restrict access: give full trace access only to engineers and operators who need it.
- Separate environments: keep production traces separate from staging and local development data.
- Sample carefully: for very high-volume systems, sample successful traces but keep all failures.
A good default is to log full traces for failures, timeouts, parser errors, low evaluation scores, and explicit negative feedback. For successful traffic, sample enough to analyze quality and cost trends without storing unnecessary data.
Trace Prompt Chains and Compiled Workflows
Many LLM apps are no longer a single prompt call. They use prompt chains, routers, planners, workers, validators, and fallbacks. If your application composes prompts dynamically, you need traces that preserve the chain structure.
For a workflow with multiple LLM steps, record each prompt as its own span. Include parent-child relationships so you can see how one model call influenced the next. If your team is using compiler-style patterns for LLM workflows, an LLM compiler can make this structure more explicit by turning high-level task definitions into planned model and tool calls.
For example, a code review agent might run these spans:
- Summarize pull request diff.
- Identify risky files.
- Retrieve repository rules.
- Generate review comments.
- Validate comments against repository rules.
- Post approved comments to GitHub.
If the agent posts a bad comment, you need to know whether the summary missed context, retrieval returned the wrong rule, or the final validation step failed.
Implementation Pattern
You can implement LLM tracing with a simple pattern: create a trace at the start of a user request, create spans around each LLM or tool step, and attach structured metadata to each span.
A basic trace object should include:
trace_idnameuser_idoraccount_idenvironmentreleasestarted_atended_atstatus
A model span should include:
span_idparent_span_idprovidermodelprompt_nameprompt_versioninput_messagesoutputinput_tokensoutput_tokenslatency_mscost_usderror
A tool span should include:
tool_nameargumentsresultlatency_msstatuserror
Keep the schema consistent. If every team logs different field names for model, prompt version, and cost, your traces become hard to query.
Production Alerts Worth Setting Up
Once traces are flowing, add alerts for failures that affect users or costs. Start with a small set so your team pays attention when alerts fire.
- Parser failure rate above 2 percent for 10 minutes.
- p95 latency above 12 seconds for a user-facing workflow.
- Tool timeout rate above 5 percent.
- Cost per request increases by more than 30 percent after a deploy.
- Fallback rate doubles compared with the previous day.
- Evaluation pass rate drops below an agreed threshold, such as 90 percent.
Tune these numbers for your product. A background report generator may tolerate 60 seconds of latency. A chat assistant usually cannot.
Common Mistakes
- Logging only the final response. This hides prompt, retrieval, tool, and parsing failures.
- Skipping prompt versions. You cannot debug regressions if you do not know which prompt ran.
- Storing sensitive data without controls. Traces are useful, but they need access control and retention rules.
- Ignoring successful requests. Failures teach you a lot, but sampled successful traces help you understand normal behavior.
- Mixing staging and production data. Keep environments clean so metrics stay accurate.
- Tracing without evaluation. A trace can look technically successful while the answer is wrong.
A Practical Rollout Plan
You do not need to trace every workflow on day one. Start with the LLM path that has the highest user impact or support load.
- Pick one production workflow. Choose something like a support bot, sales assistant, code agent, or RAG answer flow.
- Add trace IDs and spans. Cover the main model call, retrieval step, tool calls, and final response.
- Record prompt versions and model settings. Make every output reproducible enough to debug.
- Add basic evaluations. Start with schema checks, citation checks, or task-specific pass/fail rules.
- Review 20 failing traces. Look for repeated causes such as bad retrieval, missing context, or parser issues.
- Create 3 to 5 alerts. Focus on quality, latency, cost, and tool failures.
- Expand to more workflows. Reuse the same trace schema so your team can compare systems.
The goal is simple: when something breaks in production, your team should be able to find the request, inspect the full path, identify the failing step, and ship a fix with confidence.
PromptLayer helps AI teams trace LLM calls, manage prompt versions, evaluate outputs, and monitor production behavior in one place. If you are building or shipping LLM-powered applications, create a PromptLayer account to start tracing your prompts and workflows.