Back

How to Set Up Datadog for LLM Observability

Jun 06, 2026
How to Set Up Datadog for LLM Observability

How to Set Up Datadog for LLM Observability

Datadog can give your AI team a useful production view of LLM traffic, but only if you instrument the parts that make LLM systems different from normal backend services. CPU, memory, request count, and HTTP latency are not enough. You also need prompt version, model name, token usage, retrieval behavior, tool calls, eval results, cost, and failure modes at the request level.

For LLM applications, observability should answer questions like:

  • Which prompt version caused the increase in malformed JSON?
  • Which model has the highest timeout rate for agent tool calls?
  • Are retrieval misses causing lower answer quality?
  • Did the latest prompt release increase cost per successful request?
  • Are safety, correctness, or task-success evals trending down?

This guide shows a practical setup using Datadog APM, traces, logs, metrics, monitors, and LLM-specific metadata. It assumes you are already shipping an LLM app or agent and want production visibility without creating noisy dashboards that nobody acts on.

If you want a definition before implementation, see this overview of LLM observability.

What to Track for LLM Applications

A good Datadog setup starts with a clear telemetry model. For each LLM request, capture enough structured data to reconstruct what happened without storing sensitive user data by default.

Core request metadata

  • request_id: Your application-level request ID.
  • user_id_hash or tenant_id: Use a hashed or internal ID. Avoid raw email addresses or names.
  • environment: production, staging, preview, or development.
  • service: The app or worker that made the LLM call.
  • feature: chat, summarization, support_agent, code_review, extraction, or another product surface.
  • prompt_name: Stable prompt identifier, such as support_triage_v2.
  • prompt_version: The exact version shipped for that request.
  • model: For example, gpt-4.1, claude-sonnet-4, or a hosted open-weight model.
  • provider: OpenAI, Anthropic, AWS Bedrock, Azure OpenAI, Google Vertex AI, or another provider.

LLM behavior metrics

  • input_tokens: Tokens sent to the model.
  • output_tokens: Tokens generated by the model.
  • total_tokens: Input plus output tokens.
  • estimated_cost_usd: Cost per LLM request, calculated in your app or enrichment pipeline.
  • llm_latency_ms: Time spent waiting on the model provider.
  • first_token_latency_ms: Useful for streaming chat and agent UX.
  • finish_reason: stop, length, tool_call, content_filter, error, or provider-specific values.
  • retry_count: Number of retries before success or failure.
  • cache_hit: true or false, if you use prompt or response caching.

Quality and eval signals

Production LLM monitoring should include quality signals, not only uptime. Track eval scores, user feedback, task completion, refusal rate, schema validity, and guardrail outcomes. If you only track latency and error rate, you can miss a prompt regression that returns fluent but wrong answers.

  • eval_correctness_score: Numeric score from an offline or online eval.
  • eval_safety_score: Safety or policy compliance score.
  • json_valid: true or false for structured output workflows.
  • tool_success: true or false for agent tool calls.
  • retrieval_hit_count: Number of chunks retrieved.
  • retrieval_top_score: Highest similarity or reranker score.
  • user_rating: Thumbs up, thumbs down, 1 to 5, or an internal scale.

If your team is still designing evals, this guide to LLM evaluation covers the core patterns. For subjective outputs such as support responses or summaries, you may also use LLM-as-a-judge scoring, but you should validate judge behavior against human-labeled examples before relying on it for release gates.

Step 1: Set Up Datadog APM for Your LLM Service

Start with normal APM instrumentation for the service that calls your LLM provider. This gives you request traces, service maps, latency, error rate, and dependency timing.

For a Python service, install Datadog tracing:

pip install ddtrace datadog

Run your service with tracing enabled:

DD_SERVICE=ai-api \
DD_ENV=production \
DD_VERSION=2026.06.01 \
ddtrace-run uvicorn app.main:app --host 0.0.0.0 --port 8000

For containerized deployments, set the same values as environment variables:

DD_SERVICE=ai-api
DD_ENV=production
DD_VERSION=2026.06.01
DD_AGENT_HOST=datadog-agent
DD_TRACE_ENABLED=true
DD_LOGS_INJECTION=true

Use DD_VERSION for your application deploy version. Use a separate prompt_version tag for prompt releases. Mixing the two makes debugging harder because prompt changes can ship independently from code changes.

Suggested screenshot

Add a screenshot of the Datadog APM service page showing the LLM service, p95 latency, error rate, request volume, and recent deploy markers.

Step 2: Create Custom Spans Around LLM Calls

Datadog’s default tracing will show outbound HTTP requests, but your team needs an LLM-specific span with prompt, model, token, cost, retrieval, tool, and eval metadata.

Wrap each provider call with a custom span. Keep prompt text out of span tags unless you have a strict redaction and access control policy. Tags are searchable, which makes them powerful and risky.

from ddtrace import tracer
import time

def call_llm(
    client,
    messages,
    model,
    prompt_name,
    prompt_version,
    feature,
    tenant_id,
):
    start = time.time()

    with tracer.trace("llm.request", service="ai-api", resource=prompt_name) as span:
        span.set_tag("llm.provider", "openai")
        span.set_tag("llm.model", model)
        span.set_tag("llm.prompt_name", prompt_name)
        span.set_tag("llm.prompt_version", prompt_version)
        span.set_tag("llm.feature", feature)
        span.set_tag("llm.tenant_id", tenant_id)
        span.set_tag("llm.streaming", False)

        try:
            response = client.chat.completions.create(
                model=model,
                messages=messages,
                temperature=0.2,
            )

            latency_ms = int((time.time() - start) * 1000)

            usage = response.usage
            input_tokens = usage.prompt_tokens
            output_tokens = usage.completion_tokens
            total_tokens = usage.total_tokens

            span.set_metric("llm.latency_ms", latency_ms)
            span.set_metric("llm.input_tokens", input_tokens)
            span.set_metric("llm.output_tokens", output_tokens)
            span.set_metric("llm.total_tokens", total_tokens)
            span.set_tag("llm.finish_reason", response.choices[0].finish_reason)
            span.set_tag("llm.status", "success")

            return response

        except Exception as exc:
            span.set_tag("llm.status", "error")
            span.set_tag("llm.error_type", exc.__class__.__name__)
            span.set_tag("error", True)
            span.set_tag("error.msg", str(exc))
            raise

Use span tags for values you want to filter or group by, such as model, prompt version, feature, provider, and environment. Use span metrics for numeric values, such as latency, tokens, and cost.

Suggested screenshot

Add a screenshot of a Datadog trace view with a custom llm.request span nested under the product API request. Annotate model, prompt version, token count, latency, and status.

Step 3: Add RAG and Agent Spans

Most production LLM failures do not come from the model alone. They often come from bad retrieval, missing context, invalid tool arguments, slow tools, or weak orchestration. Trace those steps separately.

  • http.request: Incoming user or system request.
  • rag.retrieval: Vector search, keyword search, reranking, or hybrid retrieval.
  • prompt.render: Prompt assembly and context construction.
  • llm.request: Model call.
  • tool.call: Agent tool call, API call, database lookup, or code execution.
  • output.validation: JSON schema validation, policy check, or response parser.
  • eval.score: Online eval, guardrail, or user feedback event.

For RAG systems, add tags that explain retrieval quality:

with tracer.trace("rag.retrieval", service="ai-api") as span:
    span.set_tag("rag.index", "help_center_prod")
    span.set_tag("rag.strategy", "hybrid")
    span.set_metric("rag.query_length", len(query))
    span.set_metric("rag.hit_count", len(chunks))
    span.set_metric("rag.top_score", top_score)
    span.set_metric("rag.latency_ms", retrieval_latency_ms)

For agents, use a separate span per tool call:

with tracer.trace("tool.call", service="ai-api", resource=tool_name) as span:
    span.set_tag("tool.name", tool_name)
    span.set_tag("tool.status", "success")
    span.set_metric("tool.latency_ms", latency_ms)
    span.set_tag("tool.retry_count", retry_count)

This structure lets you debug specific failure modes. If answer quality drops while model latency stays flat, you can check retrieval hit count, reranker score, prompt version, and eval scores in the same trace.

Suggested screenshot

Add an annotated LLM request trace showing rag.retrieval, prompt.render, llm.request, tool.call, and eval.score spans. Add callouts for where quality, latency, and cost data appear.

Step 4: Send Custom Metrics for Cost, Tokens, and Quality

Traces help debug single requests. Metrics help detect system-wide changes. Send custom metrics for values your team reviews every day or uses in alerts.

Use DogStatsD or the Datadog API to emit metrics. Here is a simple DogStatsD example:

from datadog import initialize, statsd

initialize(statsd_host="datadog-agent", statsd_port=8125)

def record_llm_metrics(
    model,
    prompt_name,
    prompt_version,
    feature,
    input_tokens,
    output_tokens,
    latency_ms,
    estimated_cost_usd,
    json_valid,
):
    tags = [
        f"model:{model}",
        f"prompt_name:{prompt_name}",
        f"prompt_version:{prompt_version}",
        f"feature:{feature}",
        "env:production",
    ]

    statsd.increment("llm.requests", tags=tags)
    statsd.histogram("llm.input_tokens", input_tokens, tags=tags)
    statsd.histogram("llm.output_tokens", output_tokens, tags=tags)
    statsd.histogram("llm.latency_ms", latency_ms, tags=tags)
    statsd.distribution("llm.estimated_cost_usd", estimated_cost_usd, tags=tags)

    if not json_valid:
        statsd.increment("llm.json_invalid", tags=tags)

Useful metric names include:

  • llm.requests: Count of LLM calls.
  • llm.errors: Provider errors, validation errors, timeout errors, and policy errors.
  • llm.latency_ms: LLM provider latency.
  • llm.first_token_latency_ms: Streaming response latency.
  • llm.input_tokens: Prompt and context token volume.
  • llm.output_tokens: Completion token volume.
  • llm.estimated_cost_usd: Estimated spend by request, prompt, model, or feature.
  • llm.eval.correctness: Correctness eval score.
  • llm.eval.safety: Safety eval score.
  • llm.json_invalid: Invalid structured output count.
  • llm.tool_failures: Failed agent tool calls.

Be careful with metric tags. High-cardinality tags such as raw user IDs, full prompts, document IDs, or session IDs can create cost and performance problems. Prefer bounded values such as model, provider, feature, prompt name, prompt version, environment, and tenant tier.

Suggested screenshot

Add a Datadog custom metric dashboard showing request volume, p95 LLM latency, token usage, estimated cost, invalid JSON rate, and eval score trend grouped by prompt version.

Step 5: Connect Logs Without Storing Sensitive Prompt Data

Logging raw prompts and completions is one of the fastest ways to create a data exposure problem. LLM requests often contain customer messages, source code, support tickets, legal text, health data, credentials, or internal plans. Treat prompt and completion text as sensitive by default.

A safer logging pattern is:

  • Store request IDs and trace IDs in logs.
  • Log prompt name and prompt version, not raw prompt text.
  • Log token counts, model, provider, latency, status, and error type.
  • Redact secrets before logs leave your app.
  • Store full prompt and completion payloads only in a controlled system with retention, access control, and redaction.

Example structured log:

{
  "event": "llm_request_completed",
  "request_id": "req_9f3a",
  "trace_id": "123456789",
  "feature": "support_agent",
  "prompt_name": "support_triage",
  "prompt_version": "2026-06-01.3",
  "provider": "anthropic",
  "model": "claude-sonnet-4",
  "input_tokens": 1840,
  "output_tokens": 312,
  "latency_ms": 1430,
  "estimated_cost_usd": 0.0098,
  "status": "success"
}

If you need payload-level inspection for debugging, route sensitive LLM data to a purpose-built prompt and trace store with masking and permission controls. Datadog can keep the operational signal while the prompt platform stores the LLM payload history.

Step 6: Build Dashboards Around Decisions

A dashboard should help an engineer decide what to do. If it only shows a wall of charts, it will not help during an incident or release review.

Create separate dashboards for different operating questions.

Production health dashboard

  • LLM request volume by feature.
  • Provider error rate by model.
  • p50, p95, and p99 LLM latency.
  • Timeout rate and retry count.
  • Invalid JSON or parser failure rate.
  • Tool failure rate for agents.

Cost dashboard

  • Estimated cost per hour and per day.
  • Cost by feature.
  • Cost by prompt version.
  • Input and output tokens by model.
  • Average cost per successful request.

Quality dashboard

  • Eval score trend by prompt version.
  • User feedback rate and negative feedback rate.
  • Refusal rate.
  • Retrieval hit count and top retrieval score.
  • Task completion rate for agents.

Group charts by prompt_version and model. Many LLM incidents come from a prompt or model change, not a code deploy. If you do not tag prompt versions, you will waste time guessing which change caused the regression.

Suggested screenshot

Add a dashboard screenshot with three rows: health, cost, and quality. Include a filter for prompt version, model, feature, provider, and environment.

Step 7: Create Alerts That Trigger Action

Dashboards are passive. Monitors should tell the team when to act. Avoid alerting on every small variation. Use alerts tied to user impact, cost risk, or quality regression.

Good LLM alerts include:

  • Provider error rate: Alert when errors exceed 3% for 10 minutes on production traffic.
  • p95 LLM latency: Alert when p95 latency is above 8 seconds for 15 minutes for a user-facing feature.
  • Invalid JSON rate: Alert when structured output failures exceed 2% for 10 minutes.
  • Cost spike: Alert when hourly estimated cost is 2x higher than the same hour yesterday.
  • Eval regression: Alert when correctness score drops more than 5 percentage points after a prompt release.
  • Tool failure rate: Alert when agent tool failures exceed 5% for 10 minutes.
  • Retrieval degradation: Alert when average top retrieval score drops below an agreed threshold for a key workflow.

Each alert should include a runbook link or clear next step. For example:

  • Check recent prompt versions for the affected feature.
  • Compare model latency across providers or regions.
  • Inspect traces with invalid JSON output.
  • Roll back the prompt version if eval and production metrics both regressed.
  • Disable a failing tool or route to a fallback workflow.

Suggested screenshot

Add a Datadog monitor setup screenshot for invalid JSON rate or eval regression. Show the query, threshold, notification message, and runbook link.

Step 8: Track Prompt Versions as First-Class Production Metadata

Prompt versions should be visible in traces, logs, metrics, dashboards, and alerts. Without prompt versioning, you cannot reliably connect a production issue to the change that caused it.

Use tags like:

llm.prompt_name:support_triage
llm.prompt_version:2026-06-01.3
llm.prompt_environment:production
llm.prompt_variant:treatment_b

This lets you compare versions during experiments and rollouts:

  • Version 2026-06-01.2 has a 1.1% invalid JSON rate.
  • Version 2026-06-01.3 has a 6.8% invalid JSON rate.
  • The regression only appears on claude-sonnet-4 with long support tickets.
  • Rolling back the prompt version fixes the issue without a code deploy.

If you manage prompts outside your application code, make sure your runtime always passes the resolved prompt version into Datadog. A prompt management platform can help by attaching version metadata to each request automatically. PromptLayer’s LLM observability tooling is built around this request-level connection between prompts, traces, evals, datasets, and production behavior.

Step 9: Add Eval Results to Production Traces

Many teams stop at tracing and metrics. That leaves a major blind spot: quality. A request can be fast, cheap, and technically successful while still being wrong.

Add eval spans or metrics when you run online checks:

with tracer.trace("eval.score", service="ai-api", resource="correctness_check") as span:
    span.set_tag("eval.name", "support_answer_correctness")
    span.set_tag("eval.type", "llm_judge")
    span.set_metric("eval.score", 0.82)
    span.set_tag("eval.pass", True)
    span.set_tag("llm.prompt_version", prompt_version)
    span.set_tag("llm.model", model)

You can run evals at different points:

  • Pre-release: Run test datasets before promoting a prompt version.
  • Shadow mode: Score production-like traffic without affecting users.
  • Online sampling: Score 1% to 10% of production requests, depending on cost and risk.
  • Post-incident: Build a dataset from failed traces and use it for regression testing.

Do not treat judge scores as perfect truth. Track agreement with labeled examples, watch for drift, and review low-confidence cases. For high-risk workflows, combine automated evals with domain review before changing release gates.

Common Mistakes to Avoid

Logging sensitive prompt data

Raw prompt and completion logs can expose customer data, internal documents, credentials, or source code. Redact aggressively. Store payloads only where you have access control, retention rules, and auditability.

Tracking only infrastructure metrics

CPU, memory, and HTTP latency do not tell you whether an LLM answer was correct. Add prompt, model, token, retrieval, tool, and eval signals.

Failing to tag prompt versions

If prompt versions are missing from Datadog, every prompt regression looks like a generic production issue. Add prompt version to traces, metrics, logs, and alerts.

Ignoring eval quality signals

A 200 response from your API can still contain a bad answer. Track eval scores, user feedback, schema validity, and task completion.

Creating dashboards without actionable alerts

A dashboard that nobody checks during an incident has limited value. Add monitors with thresholds, owners, and runbooks.

Example Datadog Trace for an LLM Request

A useful annotated trace might look like this:

POST /api/support-agent
  ├── auth.check
  ├── rag.retrieval
  │   ├── rag.index: help_center_prod
  │   ├── rag.strategy: hybrid
  │   ├── rag.hit_count: 5
  │   └── rag.top_score: 0.82
  ├── prompt.render
  │   ├── llm.prompt_name: support_triage
  │   ├── llm.prompt_version: 2026-06-01.3
  │   └── prompt.input_tokens_estimate: 1820
  ├── llm.request
  │   ├── llm.provider: anthropic
  │   ├── llm.model: claude-sonnet-4
  │   ├── llm.input_tokens: 1840
  │   ├── llm.output_tokens: 312
  │   ├── llm.latency_ms: 1430
  │   └── llm.estimated_cost_usd: 0.0098
  ├── output.validation
  │   ├── json_valid: true
  │   └── schema_name: support_triage_response
  └── eval.score
      ├── eval.name: support_answer_correctness
      ├── eval.score: 0.82
      └── eval.pass: true

This trace gives an engineer a path to debug the request. If the output is wrong, they can check retrieval quality, prompt version, model behavior, validation, and eval output in one place.

Implementation Checklist

  1. Enable Datadog APM for the service that calls your LLM provider.
  2. Add custom spans for LLM calls, retrieval, prompt rendering, tools, validation, and evals.
  3. Tag every request with prompt name, prompt version, model, provider, feature, and environment.
  4. Send metrics for latency, tokens, cost, errors, invalid output, tool failures, and eval scores.
  5. Connect structured logs using request IDs and trace IDs.
  6. Redact prompt and completion payloads before sending data to general-purpose logs.
  7. Create dashboards for health, cost, and quality.
  8. Add alerts for production impact, quality regression, and cost spikes.
  9. Review high-cardinality tags before they reach production.
  10. Build release workflows that compare prompt versions before and after rollout.

Final Notes

Datadog works well for infrastructure, APM, logs, metrics, and alerting. For LLM systems, you need to add the missing application context: prompt versions, model parameters, retrieval state, tool behavior, token usage, cost, and eval outcomes.

The goal is not to collect every possible field. The goal is to collect the fields that let your team debug failures, detect regressions, control cost, and ship prompt changes with confidence.


Connect Datadog Observability to PromptLayer

PromptLayer helps AI teams manage prompts, track versions, run evals, inspect request history, and connect production behavior back to the prompt changes that caused it. Datadog can monitor the service layer, while PromptLayer gives your team the LLM-specific workflow for prompts, datasets, evaluations, and traces.

If you are building or shipping LLM applications, create a PromptLayer account here: https://dashboard.promptlayer.com/create-account

The first platform built for prompt engineering