How to Track LLM Usage, Cost, and Quality
Tracking LLM usage, cost, and quality is a production requirement once your app has real users. Without request-level records, you cannot explain a cost spike, debug a bad answer, compare prompt versions, or prove that a model change improved quality.
Good tracking gives your team a shared view of four things:
- Usage: who called which model, how often, and through which feature.
- Cost: prompt tokens, completion tokens, cached tokens, tool calls, retries, and total spend.
- Quality: task success, user feedback, eval scores, regression status, and error categories.
- Traceability: the full path from user request to prompt, model call, retrieved context, tool call, and final response.
This guide walks through a practical tracking setup for teams shipping LLM-powered products, agents, internal copilots, and AI workflows.
Start with a request-level LLM call log
Aggregate charts are useful, but they are not enough. If your only view is “daily tokens by model,” you will struggle to debug individual failures. Track every LLM request as a structured event.
At minimum, each call should include:
- Request ID
- User or account ID, with sensitive values hashed or redacted
- Environment, such as production, staging, or development
- Feature or workflow name
- Prompt name and prompt version
- Model and provider
- Input tokens, output tokens, cached tokens, and total tokens
- Estimated cost
- Latency
- Status, including success, error, timeout, refusal, or parse failure
- Trace ID and parent step ID for multi-step workflows
- Evaluation status or score, when available
Example LLM call log table
| Timestamp | Trace ID | Feature | Prompt | Version | Model | Tokens | Cost | Latency | Status | Quality |
|---|---|---|---|---|---|---|---|---|---|---|
| 2026-06-06 10:14:22 | trc_9f42 | support_reply | draft_response | v18 | gpt-4.1-mini | 1,842 | $0.0061 | 1.4s | success | pass |
| 2026-06-06 10:15:03 | trc_9f43 | invoice_agent | extract_fields | v07 | claude-3-5-sonnet | 4,210 | $0.0580 | 3.8s | json_parse_error | fail |
| 2026-06-06 10:15:44 | trc_9f44 | search_answer | rag_answer | v31 | gpt-4.1 | 8,905 | $0.1182 | 6.2s | success | needs_review |
Use this log as your source of truth. Dashboards, alerts, eval reports, and review queues should all point back to individual records.
Define a metadata schema before traffic grows
Metadata turns raw model calls into useful engineering data. You need enough metadata to answer questions such as:
- Which customer account drove the cost increase?
- Did the new prompt version cause more tool failures?
- Which workflow step adds the most latency?
- Are users downvoting answers from a specific model?
- Do failures cluster around one document type, locale, or integration?
A good metadata schema stays stable, even as prompts and models change. Keep field names consistent across services. Avoid dumping arbitrary blobs into a single “metadata” field if your team will need to filter by those values later.
Example metadata schema for LLM tracking
| Field | Example | Purpose | PII Risk |
|---|---|---|---|
trace_id |
trc_9f42 |
Links all calls in one user request or agent run | Low |
user_hash |
u_82ab91 |
Groups usage by user without storing raw email | Medium |
account_id |
acct_1042 |
Supports customer-level cost and quality reports | Medium |
feature |
support_reply |
Separates product surfaces and workflows | Low |
prompt_name |
draft_response |
Connects call behavior to prompt ownership | Low |
prompt_version |
v18 |
Supports rollbacks and regression checks | Low |
retrieval_collection |
help_center_v3 |
Debugs RAG answer quality | Low |
tool_name |
create_invoice |
Tracks agent tool behavior | Low |
input_classification |
billing_question |
Groups requests by task type | Low |
contains_sensitive_data |
false |
Routes records to the right retention policy | High if wrong |
Do not log raw secrets, API keys, passwords, medical records, full payment details, or private customer content unless you have a clear retention, access, and redaction policy. For many teams, the safer default is to log structured metadata, token counts, prompt versions, and redacted inputs.
Track usage by feature, model, prompt, and customer
Usage tracking should tell you where model calls come from and whether they match product value. A weekly report with 10 million tokens used is less useful than a report that says:
- The
support_replyfeature used 4.2 million tokens and served 18,400 conversations. - The
invoice_agentworkflow used 2.1 million tokens, but 22% came from retries. - One enterprise account generated 31% of total cost due to long PDF inputs.
- Prompt version
v19increased average input tokens by 38% after adding extra examples.
Group usage by:
- Feature: product area or workflow name
- Prompt: prompt template and version
- Model: provider, model name, and model version if available
- Customer: account, workspace, plan, or internal team
- Environment: production, staging, development, and batch jobs
- Step: planner, retriever, generator, critic, tool caller, summarizer, or evaluator
This breakdown helps you set budgets and assign ownership. If a prompt creates excessive cost, the prompt owner should see it. If one workflow keeps timing out, the team that owns that workflow should get the alert.
Calculate cost at the call level
LLM cost tracking should happen per call, not only per provider invoice. Provider invoices arrive too late for engineering decisions, and they rarely map cleanly to your product features.
For each LLM call, store:
- Input tokens
- Output tokens
- Cached input tokens, if the provider reports them
- Reasoning tokens, if exposed by the model API
- Embedding tokens, for retrieval or indexing calls
- Tool call cost, if external APIs charge per request
- Retry count and retry cost
- Total estimated cost in USD or your reporting currency
A simple cost formula looks like this:
total_cost =
(input_tokens / 1_000_000 * input_price_per_1m) +
(output_tokens / 1_000_000 * output_price_per_1m) +
tool_cost +
retry_costStore the pricing version used at the time of calculation. Model prices change. If you recalculate old usage with new prices, historical reports can drift and confuse finance or product teams.
Build a dashboard that answers operational questions
Your dashboard should help engineers act. Avoid dashboards that look busy but fail to answer concrete questions.
Example LLM usage, cost, and quality dashboard
| Panel | Metric | Useful Filter | Action if Unhealthy |
|---|---|---|---|
| Daily cost | Total spend by feature and model | Environment, account, prompt version | Check top callers, retries, long contexts, and model mix |
| Token usage | Input, output, cached, and total tokens | Prompt, workflow step, customer plan | Trim context, improve retrieval, cap output length |
| Latency | p50, p95, p99 response time | Model, region, tool name | Inspect slow traces and external tool calls |
| Error rate | Timeouts, provider errors, parse errors, refusals | Prompt version, model, endpoint | Fix schema handling, retry policy, or provider fallback |
| Quality score | Eval pass rate and user feedback | Dataset, task type, release | Review failed examples and compare prompt versions |
| Agent trace health | Failed steps per run and tool success rate | Agent name, step type, tool | Inspect step-level traces and tool inputs |
For production LLM systems, LLM observability means more than logging the final answer. You need enough context to inspect the prompt, model response, tool outputs, retrieved documents, errors, and eval results for a single run.
Version every prompt you ship
If you do not version prompts, your tracking data loses a major debugging dimension. A model may appear unstable when the real cause is a prompt edit that changed output format, examples, tone, or context order.
Track these fields for every request:
prompt_nameprompt_versiontemplate_variables, with sensitive values redactedmodelmodel_parameters, such as temperature, max tokens, top p, and response formatrelease_tag, such ascheckout-agent-2026-06-06
This makes rollbacks faster. If a release increases parse errors from 1.2% to 8.9%, you can compare prompt versions instead of searching through code commits and deployment logs.
Link traces across agent steps
Agent workflows need trace-level tracking. A final answer may look wrong because the planner picked the wrong tool, the retriever returned stale documents, the model produced malformed JSON, or the tool call failed and a fallback hid the error.
Use a single trace_id for the full run and a span_id for each step. Each step should record its parent span, inputs, outputs, status, latency, cost, and related prompt version.
This is especially important for plan-and-execute agents, where the plan, each action, and the final synthesis can fail independently.
Example trace structure for an agent run
| Trace ID | Span ID | Parent | Step | Prompt Version | Status | Cost |
|---|---|---|---|---|---|---|
| trc_agent_118 | spn_001 | plan | planner_v12 | success | $0.014 | |
| trc_agent_118 | spn_002 | spn_001 | retrieve_contract | success | $0.002 | |
| trc_agent_118 | spn_003 | spn_001 | extract_terms | extractor_v08 | json_parse_error | $0.021 |
| trc_agent_118 | spn_004 | spn_003 | retry_extract_terms | extractor_v08 | success | $0.020 |
| trc_agent_118 | spn_005 | spn_001 | final_answer | answer_v05 | success | $0.011 |
Do not let retries disappear from your logs. Retries often hide cost and quality problems. Track both the failed attempt and the successful retry.
Measure quality with evals and review queues
Quality tracking should combine automated evaluation, user feedback, and targeted review. No single metric covers every failure mode.
Common quality signals include:
- Binary pass or fail: Did the response meet the task requirement?
- Rubric score: Rate correctness, completeness, tone, citation quality, and formatting.
- Schema validity: Did the output parse and match the required contract?
- Tool success: Did the agent call the correct tool with valid arguments?
- User feedback: Thumbs up, thumbs down, edits, regenerated responses, or support escalations.
- Regression status: Did a new prompt or model perform worse on a fixed dataset?
Use LLM evaluation to test prompt and model changes before release. For subjective tasks, an LLM-as-a-judge workflow can help score outputs against a rubric, as long as you audit the judge and keep examples of bad judgments.
Set up a review process for collected data. Logging thousands of failures without reviewing them creates storage cost and false confidence. A practical review loop looks like this:
- Sample 50 to 100 production traces per high-volume feature each week.
- Review all high-cost outliers, parse failures, and user-downvoted responses.
- Tag failure causes, such as retrieval miss, prompt ambiguity, wrong tool, stale context, or unsafe response.
- Add representative failures to an eval dataset.
- Test prompt, retrieval, and model changes against that dataset before release.
Track failed requests as first-class records
Many teams log successful responses and miss failed requests. This creates a biased picture of the system. Failed calls often contain the most useful debugging data.
Track failures such as:
- Provider 429 rate limits
- Provider 500 errors
- Timeouts
- Client-side cancellation
- Malformed JSON
- Schema validation errors
- Tool call failures
- Empty responses
- Safety refusals
- Context length errors
Include partial data when a request fails. For example, you can still store the prompt version, model, feature, input token estimate, trace ID, latency before failure, and error code.
Set alerts for cost, latency, error rate, and quality drops
Alerts should catch real problems without paging your team for normal variation. Start with thresholds that map to user impact or budget impact.
Example LLM alert configuration
| Alert | Condition | Window | Severity | Owner | First Check |
|---|---|---|---|---|---|
| Cost spike | Spend is 2x higher than same hour average | 60 minutes | High | AI platform | Top features, retries, long contexts |
| Parse errors | JSON parse error rate exceeds 5% | 30 minutes | High | Feature owner | Prompt version, response format, model change |
| Latency | p95 latency exceeds 8 seconds | 15 minutes | Medium | Backend | Provider status, tool latency, token count |
| Quality regression | Eval pass rate drops below 92% | Per release | Blocker | Prompt owner | Failed eval cases and recent prompt diff |
| Missing traces | More than 1% of requests lack trace ID | 24 hours | Medium | AI platform | SDK instrumentation and async jobs |
For cost alerts, compare against expected traffic. A 2x spike during a product launch may be healthy. A 2x spike at 3 a.m. caused by retry loops needs immediate attention.
Connect tracking to release gates
Tracking becomes more valuable when it affects releases. Add gates for prompt and model changes, especially on workflows that write data, take actions, or answer customers directly.
A practical release checklist:
- Run the new prompt against a fixed eval dataset.
- Compare pass rate, average cost, p95 latency, and parse error rate against the current production version.
- Review at least 20 failed or borderline examples.
- Canary the new version to 5% to 10% of traffic.
- Watch cost, error rate, and quality for at least one business cycle.
- Keep rollback ready by retaining the previous prompt version.
If you use prompt chains or compiler-style systems, connect each generated or selected prompt back to the run that created it. An LLM compiler can make workflows more dynamic, but you still need versioned artifacts and traceable execution.
Common mistakes to avoid
Logging sensitive data without a policy
Raw prompts and outputs may contain customer data, secrets, legal text, health details, or internal business information. Redact or hash sensitive fields before storage. Restrict access. Define retention windows. For example, keep full redacted traces for 30 days, metadata for 180 days, and eval datasets only after review.
Tracking only aggregate metrics
Aggregate metrics hide the examples your team needs to debug. Keep request-level logs and link every chart back to the underlying traces.
Failing to version prompts
Prompt edits can change cost and quality as much as model changes. Treat prompts as versioned production assets. Store the prompt version on every call.
Missing failed requests
If you log only successful calls, your quality numbers will look better than reality. Record failed calls, partial responses, provider errors, timeouts, retries, and parse failures.
Not linking traces across agent steps
Agent failures often happen before the final response. Link planner steps, retrieval, tool calls, retries, and final answers under one trace ID.
Collecting data without review
Data does not improve your system by itself. Assign owners, review failed examples, label causes, add important cases to eval datasets, and track whether fixes work.
A simple implementation plan
If your team is starting from basic logs, use this rollout plan:
- Week 1: Add request-level logging for model, prompt version, tokens, cost, latency, status, and trace ID.
- Week 2: Add metadata for feature, account, environment, workflow step, and model parameters.
- Week 3: Build dashboards for cost by feature, error rate by prompt version, and p95 latency by model.
- Week 4: Add eval scores, user feedback, and review queues for failed or low-quality examples.
- Week 5: Add alerts and release gates for prompt and model changes.
You do not need a perfect tracking system on day one. You do need consistent IDs, prompt versions, cost fields, failure records, and a review loop. Those pieces make every later improvement easier.
What good LLM tracking gives your team
A mature tracking setup lets you answer production questions quickly:
- Which prompt version caused the regression?
- Which customer, feature, or workflow caused the cost spike?
- Are retries hiding provider instability?
- Did the model migration improve quality enough to justify the cost?
- Which failed examples should become eval cases?
- Where should the team optimize context length, retrieval, or tool calls?
The goal is simple: make LLM behavior measurable at the level where engineers can act. Track the call, connect it to the trace, attach cost and quality, and review the examples that matter.
PromptLayer helps AI teams track prompts, versions, LLM requests, traces, costs, evals, and production behavior in one place. If you are building or shipping LLM applications, you can create a PromptLayer account and start instrumenting your workflows.