Back

How to Monitor LLM Latency Spikes

Jun 05, 2026
How to Monitor LLM Latency Spikes

LLM latency spikes are hard to debug because “the model was slow” is rarely the full answer. A single request can include prompt assembly, retrieval, reranking, tool calls, model generation, JSON parsing, retries, guardrails, and post-processing. If you only track total request duration, you will know users are waiting, but you will not know where the time went.

Good latency monitoring breaks each LLM workflow into measurable spans, tracks model and token behavior, and alerts on the parts of the system that users actually feel. The goal is simple: find slow requests fast enough to fix them before they become a production incident.

Start with the latency numbers that matter

Do not rely on average latency. Averages hide the tail. Your users usually complain about p95 and p99 behavior, not the mean.

Track these metrics for every LLM workflow:

  • p50 latency: the normal experience for a typical request.
  • p95 latency: the slow experience affecting a meaningful share of users.
  • p99 latency: the worst production cases, often where incidents appear first.
  • Time to first token: how long streaming users wait before seeing output.
  • Tokens per second: generation speed after the model starts responding.
  • Total tokens: input tokens plus output tokens, grouped by workflow and model.
  • Retry count: retries often turn a 4-second request into a 20-second request.
  • Timeout count: timeouts show when a system is failing slowly instead of failing clearly.

For example, a support chatbot might have a p50 latency of 2.1 seconds and a p95 latency of 18 seconds. That spread tells you the normal path is fine, but some branch of the workflow is getting stuck. You need trace-level data to find that branch.

Trace the full LLM request path

An LLM request usually contains multiple steps. Each step needs its own span with start time, end time, inputs, outputs, status, and metadata.

A practical trace might include:

  1. User request received
  2. User metadata loaded
  3. Query rewritten for retrieval
  4. Vector search completed
  5. Documents reranked
  6. Prompt assembled
  7. LLM call started
  8. First token received
  9. LLM call completed
  10. JSON parsed
  11. Tool call executed
  12. Final response returned

This structure lets you separate provider latency from your own orchestration latency. If vector search takes 900 ms, reranking takes 1.8 seconds, and the model takes 3 seconds, you have a different problem than a request where the model alone takes 14 seconds.

If your team is still building this instrumentation, start with the basics of LLM observability: logs, traces, metrics, prompt versions, model metadata, and user-facing outcomes connected in one place.

Break latency into useful categories

Label each latency spike by cause. This makes your dashboard useful during an incident and helps you prioritize engineering work after it.

Provider latency

This is time spent waiting on the model provider. Track it by provider, model, region, endpoint, and status code. Also record whether the request used streaming, JSON mode, function calling, vision inputs, or a large context window.

Example fields to log:

  • provider: OpenAI, Anthropic, Google, self-hosted, or another provider
  • model: exact model name, not a vague label like “fast model”
  • region: if applicable
  • input tokens: prompt and context size
  • output tokens: generated response size
  • streaming: true or false
  • request ID: provider request ID when available

Prompt and context latency

Large prompts increase latency and cost. Long retrieved context can make a previously fast workflow slow after your knowledge base grows.

Watch for sudden increases in:

  • System prompt length
  • Number of retrieved documents
  • Total retrieved context tokens
  • Conversation history tokens
  • Tool result tokens

A common production issue: a chat workflow keeps appending conversation history without summarization. The first message takes 2 seconds. After 20 turns, the same workflow takes 11 seconds because every call sends thousands of extra tokens.

Retrieval latency

Retrieval-augmented generation adds several possible failure points. Measure vector search, metadata filtering, reranking, document hydration, and permission checks separately.

If retrieval latency spikes only for enterprise customers, the issue may be tenant-level document volume or access-control filtering. If it spikes only for a specific query type, the issue may be broad semantic search returning too many candidates.

Tool call latency

Agents and tool-using workflows can hide latency inside external systems. A model may respond quickly, then wait 8 seconds for a CRM, database, browser action, or internal API.

Log each tool call with:

  • Tool name
  • Arguments, with sensitive values redacted
  • Duration
  • Status code or error type
  • Retry count
  • Output size

This matters especially for agents. One slow tool can trigger a slow reasoning loop. Three retries inside a tool can look like one slow LLM request unless you trace the workflow properly.

Measure streaming separately from total completion time

For streaming interfaces, total latency is not the only user experience metric. A 12-second streamed answer can feel acceptable if the first token arrives in 700 ms. A 5-second non-streamed answer can feel broken if the UI stays blank.

Track both:

  • Time to first token: request start to first streamed token
  • Time to final token: request start to completed response

If time to first token spikes but tokens per second stays normal, the provider may be queueing the request or your prompt assembly may be slow. If time to first token is stable but total completion time spikes, output length or generation speed is probably the issue.

Group latency by prompt version and model version

Latency often changes after a prompt edit, model switch, or routing update. If your traces do not include prompt version and model version, you will waste time comparing requests that are not running the same workflow.

At minimum, tag each request with:

  • Prompt name
  • Prompt version
  • Model name
  • Temperature and key generation parameters
  • Workflow or chain name
  • Environment, such as staging or production
  • Deployment commit SHA

For prompt chains, record each step separately. If you use an LLM compiler pattern or any system that transforms tasks into multiple model calls, request-level latency alone will hide the slowest stage.

Set alerts on symptoms and causes

Alerting only on total latency creates noisy incidents. Alert on user-facing symptoms, then use cause-level metrics to route the problem.

Useful alert examples:

  • Workflow p95 latency above 10 seconds for 10 minutes: user-facing incident risk.
  • Time to first token above 3 seconds for 5 minutes: streaming experience degraded.
  • Provider p95 latency doubled compared with 1-hour baseline: possible provider issue.
  • Retry rate above 5%: errors are being hidden by retries and increasing latency.
  • Average input tokens increased by 50% after deploy: prompt or context regression.
  • Tool call p95 above 2 seconds: downstream service slowing the agent.

Use different thresholds for different workflows. A background research agent may tolerate 45 seconds. A customer support chat reply may need first token latency under 1.5 seconds and full response latency under 8 seconds.

Build latency dashboards around workflows

Provider dashboards are useful, but your users experience workflows. Build dashboards for the actual paths your application runs.

A strong dashboard includes:

  • Requests per minute by workflow
  • p50, p95, and p99 latency by workflow
  • Time to first token for streaming workflows
  • Latency by model and provider
  • Latency by prompt version
  • Input and output token distributions
  • Slowest traces in the last hour
  • Error rate, retry rate, and timeout rate
  • Tool latency by tool name
  • Retrieval latency by index or corpus

Include filters for tenant, customer plan, region, deployment, and environment. Many latency problems affect one customer segment before they affect everyone.

Keep slow request samples

Metrics tell you that something is slow. Samples show you what happened.

Keep a searchable set of slow traces, especially requests above p95 or p99. For each slow request, store enough context to debug it safely:

  • Prompt version
  • Rendered prompt, with secrets and sensitive data redacted
  • Model response
  • Token counts
  • Retrieval results
  • Tool calls
  • Errors and retries
  • User-visible response time

Review these samples weekly. You may find patterns that do not appear in high-level charts, such as one prompt producing overly long answers or one tool returning oversized payloads.

Connect latency monitoring to quality checks

Faster is not always better if quality drops. If you shorten context, switch models, reduce output tokens, or add caching, test whether the workflow still behaves correctly.

Use LLM evaluation to compare latency improvements against answer quality, task completion, format accuracy, and safety checks. For subjective outputs, an LLM as a judge setup can help score responses at scale, as long as you validate the judge against human-reviewed examples.

For example, you might test three versions of a retrieval prompt:

  • Version A: 8 retrieved documents, p95 latency 9.4 seconds, quality score 92%
  • Version B: 4 retrieved documents, p95 latency 6.1 seconds, quality score 91%
  • Version C: 2 retrieved documents, p95 latency 4.8 seconds, quality score 79%

Version B is probably the best tradeoff. Version C is faster, but it loses too much accuracy.

Use caching carefully

Caching can reduce latency, but it can also hide stale answers or incorrect personalization. Cache stable, repeated work first.

Good cache candidates include:

  • Static system prompt fragments
  • Embeddings for unchanged documents
  • Common retrieval queries
  • Tool results with clear expiration rules
  • Model responses for deterministic internal tasks

Be careful caching user-specific answers, permissioned content, or responses that depend on fresh data. Always include tenant, permissions, prompt version, model version, and relevant inputs in your cache key.

Watch for retry storms

Retries make transient failures less visible, but they can create major latency spikes. A single failed provider call retried twice may triple request time. If many requests retry at once, you can overload your own workers or hit provider rate limits.

Track retries as first-class events. Log the reason, delay, provider, and final outcome. Use exponential backoff with jitter. Set a strict maximum retry count, such as 1 or 2 for user-facing chat, and fail clearly when the system cannot recover.

If your app supports fallback models, record fallback behavior separately. A fallback may save the request, but it can change latency, cost, and output quality.

Create a latency incident checklist

During an incident, teams lose time asking the same questions repeatedly. Use a short checklist.

  1. Which workflow is slow?
  2. Did p95 or p99 change first?
  3. Did time to first token change?
  4. Did input or output tokens increase?
  5. Did a prompt version, model, deployment, or retrieval config change?
  6. Is the spike isolated to one provider, model, region, tenant, or tool?
  7. Did retries, rate limits, or timeouts increase?
  8. Do slow traces share the same step?
  9. Can you route to a faster model, reduce context, disable a slow tool, or increase timeout clarity?
  10. What test will prevent the same regression from shipping again?

Fix common causes of LLM latency spikes

Once you can see the slow step, the fixes are usually specific.

  • Large prompts: summarize conversation history, trim unused instructions, and cap retrieved context.
  • Long outputs: set max token limits, ask for concise answers, and stream responses.
  • Slow retrieval: reduce candidate count, improve filters, tune indexes, and cache common queries.
  • Slow tools: add timeouts, parallelize independent calls, cache stable results, and return partial answers when acceptable.
  • Provider slowness: add fallback routing, monitor by model, and test smaller models for simple tasks.
  • Retry delays: lower retry count for interactive paths and surface clear errors sooner.
  • Agent loops: cap steps, add stop conditions, and record every tool call and model call.

A good production target is not “no latency spikes.” LLM systems depend on external providers, variable outputs, changing inputs, and downstream services. A better target is fast detection, clear attribution, and safe mitigation.

What to instrument first

If you are starting today, instrument these five things first:

  1. Total latency by workflow, with p50, p95, and p99.
  2. Provider call latency, including time to first token and total tokens.
  3. Prompt version, model version, and deployment version on every request.
  4. Separate spans for retrieval, tool calls, and post-processing.
  5. Saved slow traces for requests above p95.

This gives your team enough data to answer the most important production question: is the model slow, is our orchestration slow, or did we ship a workflow change that made requests heavier?


PromptLayer helps AI teams monitor LLM latency, trace prompt chains, compare prompt versions, review slow requests, and connect production behavior to evaluations. If you are building or shipping LLM-powered applications, create a PromptLayer account to start tracking your workflows in production.

The first platform built for prompt engineering