Back

How to Track LLM Usage, Cost, and Quality

Jun 06, 2026
How to Track LLM Usage, Cost, and Quality

Tracking LLM usage, cost, and quality is a production requirement once your app has real users. Without request-level records, you cannot explain a cost spike, debug a bad answer, compare prompt versions, or prove that a model change improved quality.

Good tracking gives your team a shared view of four things:

  • Usage: who called which model, how often, and through which feature.
  • Cost: prompt tokens, completion tokens, cached tokens, tool calls, retries, and total spend.
  • Quality: task success, user feedback, eval scores, regression status, and error categories.
  • Traceability: the full path from user request to prompt, model call, retrieved context, tool call, and final response.

This guide walks through a practical tracking setup for teams shipping LLM-powered products, agents, internal copilots, and AI workflows.

Start with a request-level LLM call log

Aggregate charts are useful, but they are not enough. If your only view is “daily tokens by model,” you will struggle to debug individual failures. Track every LLM request as a structured event.

At minimum, each call should include:

  • Request ID
  • User or account ID, with sensitive values hashed or redacted
  • Environment, such as production, staging, or development
  • Feature or workflow name
  • Prompt name and prompt version
  • Model and provider
  • Input tokens, output tokens, cached tokens, and total tokens
  • Estimated cost
  • Latency
  • Status, including success, error, timeout, refusal, or parse failure
  • Trace ID and parent step ID for multi-step workflows
  • Evaluation status or score, when available

Example LLM call log table

Timestamp Trace ID Feature Prompt Version Model Tokens Cost Latency Status Quality
2026-06-06 10:14:22 trc_9f42 support_reply draft_response v18 gpt-4.1-mini 1,842 $0.0061 1.4s success pass
2026-06-06 10:15:03 trc_9f43 invoice_agent extract_fields v07 claude-3-5-sonnet 4,210 $0.0580 3.8s json_parse_error fail
2026-06-06 10:15:44 trc_9f44 search_answer rag_answer v31 gpt-4.1 8,905 $0.1182 6.2s success needs_review

Use this log as your source of truth. Dashboards, alerts, eval reports, and review queues should all point back to individual records.

Define a metadata schema before traffic grows

Metadata turns raw model calls into useful engineering data. You need enough metadata to answer questions such as:

  • Which customer account drove the cost increase?
  • Did the new prompt version cause more tool failures?
  • Which workflow step adds the most latency?
  • Are users downvoting answers from a specific model?
  • Do failures cluster around one document type, locale, or integration?

A good metadata schema stays stable, even as prompts and models change. Keep field names consistent across services. Avoid dumping arbitrary blobs into a single “metadata” field if your team will need to filter by those values later.

Example metadata schema for LLM tracking

Field Example Purpose PII Risk
trace_id trc_9f42 Links all calls in one user request or agent run Low
user_hash u_82ab91 Groups usage by user without storing raw email Medium
account_id acct_1042 Supports customer-level cost and quality reports Medium
feature support_reply Separates product surfaces and workflows Low
prompt_name draft_response Connects call behavior to prompt ownership Low
prompt_version v18 Supports rollbacks and regression checks Low
retrieval_collection help_center_v3 Debugs RAG answer quality Low
tool_name create_invoice Tracks agent tool behavior Low
input_classification billing_question Groups requests by task type Low
contains_sensitive_data false Routes records to the right retention policy High if wrong

Do not log raw secrets, API keys, passwords, medical records, full payment details, or private customer content unless you have a clear retention, access, and redaction policy. For many teams, the safer default is to log structured metadata, token counts, prompt versions, and redacted inputs.

Track usage by feature, model, prompt, and customer

Usage tracking should tell you where model calls come from and whether they match product value. A weekly report with 10 million tokens used is less useful than a report that says:

  • The support_reply feature used 4.2 million tokens and served 18,400 conversations.
  • The invoice_agent workflow used 2.1 million tokens, but 22% came from retries.
  • One enterprise account generated 31% of total cost due to long PDF inputs.
  • Prompt version v19 increased average input tokens by 38% after adding extra examples.

Group usage by:

  • Feature: product area or workflow name
  • Prompt: prompt template and version
  • Model: provider, model name, and model version if available
  • Customer: account, workspace, plan, or internal team
  • Environment: production, staging, development, and batch jobs
  • Step: planner, retriever, generator, critic, tool caller, summarizer, or evaluator

This breakdown helps you set budgets and assign ownership. If a prompt creates excessive cost, the prompt owner should see it. If one workflow keeps timing out, the team that owns that workflow should get the alert.

Calculate cost at the call level

LLM cost tracking should happen per call, not only per provider invoice. Provider invoices arrive too late for engineering decisions, and they rarely map cleanly to your product features.

For each LLM call, store:

  • Input tokens
  • Output tokens
  • Cached input tokens, if the provider reports them
  • Reasoning tokens, if exposed by the model API
  • Embedding tokens, for retrieval or indexing calls
  • Tool call cost, if external APIs charge per request
  • Retry count and retry cost
  • Total estimated cost in USD or your reporting currency

A simple cost formula looks like this:

total_cost =
  (input_tokens / 1_000_000 * input_price_per_1m) +
  (output_tokens / 1_000_000 * output_price_per_1m) +
  tool_cost +
  retry_cost

Store the pricing version used at the time of calculation. Model prices change. If you recalculate old usage with new prices, historical reports can drift and confuse finance or product teams.

Build a dashboard that answers operational questions

Your dashboard should help engineers act. Avoid dashboards that look busy but fail to answer concrete questions.

Example LLM usage, cost, and quality dashboard

Panel Metric Useful Filter Action if Unhealthy
Daily cost Total spend by feature and model Environment, account, prompt version Check top callers, retries, long contexts, and model mix
Token usage Input, output, cached, and total tokens Prompt, workflow step, customer plan Trim context, improve retrieval, cap output length
Latency p50, p95, p99 response time Model, region, tool name Inspect slow traces and external tool calls
Error rate Timeouts, provider errors, parse errors, refusals Prompt version, model, endpoint Fix schema handling, retry policy, or provider fallback
Quality score Eval pass rate and user feedback Dataset, task type, release Review failed examples and compare prompt versions
Agent trace health Failed steps per run and tool success rate Agent name, step type, tool Inspect step-level traces and tool inputs

For production LLM systems, LLM observability means more than logging the final answer. You need enough context to inspect the prompt, model response, tool outputs, retrieved documents, errors, and eval results for a single run.

Version every prompt you ship

If you do not version prompts, your tracking data loses a major debugging dimension. A model may appear unstable when the real cause is a prompt edit that changed output format, examples, tone, or context order.

Track these fields for every request:

  • prompt_name
  • prompt_version
  • template_variables, with sensitive values redacted
  • model
  • model_parameters, such as temperature, max tokens, top p, and response format
  • release_tag, such as checkout-agent-2026-06-06

This makes rollbacks faster. If a release increases parse errors from 1.2% to 8.9%, you can compare prompt versions instead of searching through code commits and deployment logs.

Agent workflows need trace-level tracking. A final answer may look wrong because the planner picked the wrong tool, the retriever returned stale documents, the model produced malformed JSON, or the tool call failed and a fallback hid the error.

Use a single trace_id for the full run and a span_id for each step. Each step should record its parent span, inputs, outputs, status, latency, cost, and related prompt version.

This is especially important for plan-and-execute agents, where the plan, each action, and the final synthesis can fail independently.

Example trace structure for an agent run

Trace ID Span ID Parent Step Prompt Version Status Cost
trc_agent_118 spn_001 plan planner_v12 success $0.014
trc_agent_118 spn_002 spn_001 retrieve_contract success $0.002
trc_agent_118 spn_003 spn_001 extract_terms extractor_v08 json_parse_error $0.021
trc_agent_118 spn_004 spn_003 retry_extract_terms extractor_v08 success $0.020
trc_agent_118 spn_005 spn_001 final_answer answer_v05 success $0.011

Do not let retries disappear from your logs. Retries often hide cost and quality problems. Track both the failed attempt and the successful retry.

Measure quality with evals and review queues

Quality tracking should combine automated evaluation, user feedback, and targeted review. No single metric covers every failure mode.

Common quality signals include:

  • Binary pass or fail: Did the response meet the task requirement?
  • Rubric score: Rate correctness, completeness, tone, citation quality, and formatting.
  • Schema validity: Did the output parse and match the required contract?
  • Tool success: Did the agent call the correct tool with valid arguments?
  • User feedback: Thumbs up, thumbs down, edits, regenerated responses, or support escalations.
  • Regression status: Did a new prompt or model perform worse on a fixed dataset?

Use LLM evaluation to test prompt and model changes before release. For subjective tasks, an LLM-as-a-judge workflow can help score outputs against a rubric, as long as you audit the judge and keep examples of bad judgments.

Set up a review process for collected data. Logging thousands of failures without reviewing them creates storage cost and false confidence. A practical review loop looks like this:

  1. Sample 50 to 100 production traces per high-volume feature each week.
  2. Review all high-cost outliers, parse failures, and user-downvoted responses.
  3. Tag failure causes, such as retrieval miss, prompt ambiguity, wrong tool, stale context, or unsafe response.
  4. Add representative failures to an eval dataset.
  5. Test prompt, retrieval, and model changes against that dataset before release.

Track failed requests as first-class records

Many teams log successful responses and miss failed requests. This creates a biased picture of the system. Failed calls often contain the most useful debugging data.

Track failures such as:

  • Provider 429 rate limits
  • Provider 500 errors
  • Timeouts
  • Client-side cancellation
  • Malformed JSON
  • Schema validation errors
  • Tool call failures
  • Empty responses
  • Safety refusals
  • Context length errors

Include partial data when a request fails. For example, you can still store the prompt version, model, feature, input token estimate, trace ID, latency before failure, and error code.

Set alerts for cost, latency, error rate, and quality drops

Alerts should catch real problems without paging your team for normal variation. Start with thresholds that map to user impact or budget impact.

Example LLM alert configuration

Alert Condition Window Severity Owner First Check
Cost spike Spend is 2x higher than same hour average 60 minutes High AI platform Top features, retries, long contexts
Parse errors JSON parse error rate exceeds 5% 30 minutes High Feature owner Prompt version, response format, model change
Latency p95 latency exceeds 8 seconds 15 minutes Medium Backend Provider status, tool latency, token count
Quality regression Eval pass rate drops below 92% Per release Blocker Prompt owner Failed eval cases and recent prompt diff
Missing traces More than 1% of requests lack trace ID 24 hours Medium AI platform SDK instrumentation and async jobs

For cost alerts, compare against expected traffic. A 2x spike during a product launch may be healthy. A 2x spike at 3 a.m. caused by retry loops needs immediate attention.

Connect tracking to release gates

Tracking becomes more valuable when it affects releases. Add gates for prompt and model changes, especially on workflows that write data, take actions, or answer customers directly.

A practical release checklist:

  • Run the new prompt against a fixed eval dataset.
  • Compare pass rate, average cost, p95 latency, and parse error rate against the current production version.
  • Review at least 20 failed or borderline examples.
  • Canary the new version to 5% to 10% of traffic.
  • Watch cost, error rate, and quality for at least one business cycle.
  • Keep rollback ready by retaining the previous prompt version.

If you use prompt chains or compiler-style systems, connect each generated or selected prompt back to the run that created it. An LLM compiler can make workflows more dynamic, but you still need versioned artifacts and traceable execution.

Common mistakes to avoid

Logging sensitive data without a policy

Raw prompts and outputs may contain customer data, secrets, legal text, health details, or internal business information. Redact or hash sensitive fields before storage. Restrict access. Define retention windows. For example, keep full redacted traces for 30 days, metadata for 180 days, and eval datasets only after review.

Tracking only aggregate metrics

Aggregate metrics hide the examples your team needs to debug. Keep request-level logs and link every chart back to the underlying traces.

Failing to version prompts

Prompt edits can change cost and quality as much as model changes. Treat prompts as versioned production assets. Store the prompt version on every call.

Missing failed requests

If you log only successful calls, your quality numbers will look better than reality. Record failed calls, partial responses, provider errors, timeouts, retries, and parse failures.

Not linking traces across agent steps

Agent failures often happen before the final response. Link planner steps, retrieval, tool calls, retries, and final answers under one trace ID.

Collecting data without review

Data does not improve your system by itself. Assign owners, review failed examples, label causes, add important cases to eval datasets, and track whether fixes work.

A simple implementation plan

If your team is starting from basic logs, use this rollout plan:

  1. Week 1: Add request-level logging for model, prompt version, tokens, cost, latency, status, and trace ID.
  2. Week 2: Add metadata for feature, account, environment, workflow step, and model parameters.
  3. Week 3: Build dashboards for cost by feature, error rate by prompt version, and p95 latency by model.
  4. Week 4: Add eval scores, user feedback, and review queues for failed or low-quality examples.
  5. Week 5: Add alerts and release gates for prompt and model changes.

You do not need a perfect tracking system on day one. You do need consistent IDs, prompt versions, cost fields, failure records, and a review loop. Those pieces make every later improvement easier.

What good LLM tracking gives your team

A mature tracking setup lets you answer production questions quickly:

  • Which prompt version caused the regression?
  • Which customer, feature, or workflow caused the cost spike?
  • Are retries hiding provider instability?
  • Did the model migration improve quality enough to justify the cost?
  • Which failed examples should become eval cases?
  • Where should the team optimize context length, retrieval, or tool calls?

The goal is simple: make LLM behavior measurable at the level where engineers can act. Track the call, connect it to the trace, attach cost and quality, and review the examples that matter.


PromptLayer helps AI teams track prompts, versions, LLM requests, traces, costs, evals, and production behavior in one place. If you are building or shipping LLM applications, you can create a PromptLayer account and start instrumenting your workflows.

The first platform built for prompt engineering