Back

How to Set Up Datadog LLM Observability

Jun 02, 2026
How to Set Up Datadog LLM Observability

How to Set Up Datadog LLM Observability

Datadog can help your team trace LLM calls, watch latency, track token usage, monitor cost, and connect model behavior to the rest of your application stack. For AI teams shipping agents, prompt chains, RAG systems, or LLM-backed APIs, this gives you a clearer production view than infrastructure metrics alone.

You still need a prompt management and evaluation workflow. Datadog is useful for production telemetry, but it should not replace prompt versioning, experiment tracking, datasets, or LLM evaluation. Treat it as one part of your reliability stack.

What you should capture

A useful LLM observability setup captures more than request counts and server CPU. At minimum, track these fields:

  • Service context: service name, environment, app version, deployment version.
  • LLM context: provider, model name, model version when available, temperature, max tokens, tool choice, prompt version.
  • Prompt context: prompt template name, prompt version, chain step, agent name, retrieval collection, dataset ID when applicable.
  • Request context: request ID, user tier, organization ID, session ID, route, background job ID.
  • Performance: latency, time to first token, total tokens, input tokens, output tokens, retries, timeout count.
  • Cost: estimated input cost, output cost, total cost, cost by model, cost by customer or workspace.
  • Quality signals: user rating, task success flag, refusal flag, groundedness score, evaluator score, moderation result.

Avoid logging raw sensitive prompts or full user messages by default. Store redacted prompt text, hashes, template IDs, and metadata. If your team needs raw traces for debugging, gate access tightly, apply retention limits, and redact secrets before the data leaves your app.

Step 1: Create a Datadog service for your LLM application

Start by defining one Datadog service per deployable application, not one service per prompt. For example, use support-agent-api, contract-review-worker, or sales-copilot-backend.

Use consistent environment names:

  • prod
  • staging
  • dev

Do not mix staging and production traces in the same environment. This is one of the fastest ways to make dashboards noisy and monitors useless.

Screenshot callout: In Datadog, create or confirm your service under APM service catalog. Check that service, env, and version appear on traces before adding LLM-specific spans.

Step 2: Install tracing in your application

The exact setup depends on your language and Datadog account configuration. For Python services using OpenAI-compatible calls, start with Datadog tracing and LLM observability instrumentation.

pip install ddtrace openai tiktoken

Set the required environment variables in your runtime, container, or deployment system:

export DD_SERVICE="support-agent-api"
export DD_ENV="staging"
export DD_VERSION="2026.06.01"
export DD_SITE="datadoghq.com"
export DD_API_KEY="YOUR_DATADOG_API_KEY"

export DD_TRACE_ENABLED="true"
export DD_LLMOBS_ENABLED="1"
export DD_LLMOBS_AGENTLESS_ENABLED="1"
export DD_LLMOBS_ML_APP="support-agent"

If you run a Datadog Agent in your cluster, you may not need agentless mode. Keep one clear path for trace delivery so your team can debug missing spans quickly.

Step 3: Instrument your LLM calls

Wrap the part of your code that performs the LLM request. Tag every call with the prompt template name, prompt version, model, route, and environment.

import os
import hashlib
from openai import OpenAI
from ddtrace import patch
from ddtrace.llmobs import LLMObs

patch(openai=True)

LLMObs.enable(
    ml_app=os.getenv("DD_LLMOBS_ML_APP", "support-agent")
)

client = OpenAI()

SENSITIVE_KEYS = {"email", "phone", "ssn", "api_key", "password"}

def redact_text(value: str) -> str:
    if not value:
        return ""
    # Replace this with your real redaction policy.
    return value[:500]

def prompt_hash(prompt: str) -> str:
    return hashlib.sha256(prompt.encode("utf-8")).hexdigest()

@LLMObs.workflow(name="support_answer_workflow")
def answer_support_question(question: str, customer_tier: str):
    prompt_template_name = "support_answer_v4"
    prompt_version = "4.2.1"

    system_prompt = "You are a support assistant. Answer using the help center context."
    user_prompt = f"Customer tier: {customer_tier}\nQuestion: {question}"

    LLMObs.annotate(
        input_data={
            "prompt_template": prompt_template_name,
            "prompt_version": prompt_version,
            "question_redacted": redact_text(question),
            "prompt_hash": prompt_hash(system_prompt + user_prompt)
        },
        tags={
            "prompt_template": prompt_template_name,
            "prompt_version": prompt_version,
            "customer_tier": customer_tier,
            "model": "gpt-4o-mini"
        }
    )

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0.2,
        max_tokens=500,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ]
    )

    answer = response.choices[0].message.content

    LLMObs.annotate(
        output_data={
            "answer_redacted": redact_text(answer)
        },
        tags={
            "finish_reason": response.choices[0].finish_reason
        }
    )

    return answer

This example stores redacted text and a prompt hash. It does not send full user prompts to Datadog. Your security team may require stronger redaction, field allowlists, or no prompt text at all.

Common mistake: forgetting prompt and model versions

If you only tag model:gpt-4o-mini, you cannot tell which prompt change caused a regression. Always tag prompt_template, prompt_version, model, and your app version. This makes rollback decisions much faster.

Step 4: Capture async jobs, queues, and agent tool calls

Many production LLM calls do not happen inside a normal HTTP request. Agents call tools in loops. Workers process queue jobs. Batch evaluation jobs run in the background. If you miss these paths, Datadog will show only part of the system.

Instrument async functions and workers directly:

import asyncio
from ddtrace.llmobs import LLMObs

@LLMObs.workflow(name="nightly_summary_job")
async def run_nightly_summary_job(workspace_id: str, job_id: str):
    LLMObs.annotate(
        tags={
            "workspace_id": workspace_id,
            "job_id": job_id,
            "job_type": "nightly_summary",
            "prompt_version": "summary_v3.0.5"
        }
    )

    documents = await load_documents(workspace_id)

    summary = await summarize_documents(
        documents=documents,
        workspace_id=workspace_id,
        job_id=job_id
    )

    return summary

@LLMObs.task(name="load_documents")
async def load_documents(workspace_id: str):
    # Fetch retrieval context here.
    return ["doc 1", "doc 2"]

@LLMObs.task(name="summarize_documents")
async def summarize_documents(documents, workspace_id: str, job_id: str):
    # Call your LLM provider here.
    return "Summary text"

For Celery, Sidekiq, BullMQ, Temporal, or custom workers, pass a trace ID, request ID, job ID, and workspace ID into the job payload. Confirm that the trace links the HTTP request, queue publish, worker execution, retrieval calls, tool calls, and LLM calls.

Trace details screenshot callout: Open a Datadog trace and check whether the LLM span sits under the correct parent request or worker job. You should see route, job ID, prompt version, model, token count, latency, and error tags in one trace.

Step 5: Add custom metrics for cost, tokens, and quality

Traces help you debug single requests. Metrics help you monitor system behavior over time. Add custom metrics for fields your team will use in dashboards and alerts.

from datadog import initialize, statsd

initialize(
    api_key=os.getenv("DD_API_KEY"),
    app_key=os.getenv("DD_APP_KEY")
)

def record_llm_metrics(
    model: str,
    prompt_version: str,
    input_tokens: int,
    output_tokens: int,
    estimated_cost_usd: float,
    latency_ms: int,
    success: bool
):
    tags = [
        f"model:{model}",
        f"prompt_version:{prompt_version}",
        f"env:{os.getenv('DD_ENV', 'dev')}"
    ]

    statsd.increment("llm.request.count", tags=tags)
    statsd.histogram("llm.input_tokens", input_tokens, tags=tags)
    statsd.histogram("llm.output_tokens", output_tokens, tags=tags)
    statsd.histogram("llm.latency_ms", latency_ms, tags=tags)
    statsd.distribution("llm.cost_usd", estimated_cost_usd, tags=tags)

    if not success:
        statsd.increment("llm.error.count", tags=tags)

Keep tag cardinality under control. Good tags include model, prompt_version, env, route, and customer_tier. Avoid high-cardinality tags like raw user ID, full prompt hash on every metric, or document ID unless you have a specific reason and cost plan.

Step 6: Build a Datadog dashboard for LLM health

Create one dashboard for production LLM operations. Keep it focused. A useful first dashboard usually has 8 to 12 widgets.

  • Request volume: total LLM requests by service, route, model, and environment.
  • p50, p95, and p99 latency: split by model and prompt version.
  • Error rate: provider errors, timeouts, content filter blocks, tool failures, and parser failures.
  • Token usage: input tokens and output tokens by model.
  • Estimated cost: cost per hour, cost per workspace, and cost by model.
  • Retry rate: retries by provider and endpoint.
  • Context size: retrieved document count, prompt token size, and truncation count.
  • Quality metrics: user thumbs-up rate, task success rate, evaluator score, or LLM-as-a-judge score.
  • Top failing prompt versions: prompt versions with the highest error rate or lowest quality score.
  • Slow traces table: recent traces where LLM latency is above your p95 threshold.

Dashboard widget screenshot callout: Add a timeseries widget for p95:llm.latency_ms grouped by model and prompt_version. Place it next to a query value widget for hourly estimated cost so latency and spend are visible together.

Separate staging and production dashboards if your staging traffic is noisy. If you prefer one dashboard, add a required template variable for env and default it to prod.

Step 7: Set monitor thresholds that match user impact

Do not alert on every small model fluctuation. Start with a few monitors tied to customer impact, then tune them after one or two weeks of real traffic.

Good starting thresholds

  • Latency: alert when p95 LLM latency is above 8 seconds for 10 minutes on production traffic.
  • Error rate: alert when LLM error rate is above 3 percent for 5 minutes.
  • Timeouts: alert when provider timeout count is above 20 in 10 minutes.
  • Cost spike: alert when estimated LLM cost is 2 times higher than the same hour yesterday, or above a fixed limit such as $200 per hour.
  • Token spike: alert when p95 input tokens increase by 50 percent after a deployment.
  • Quality drop: alert when evaluator pass rate drops below 90 percent for a critical workflow.

Monitor threshold screenshot callout: In Datadog Monitors, create a metric monitor for production p95 latency. Group by model and prompt_version, set warning at 6 seconds, alert at 8 seconds, and require 10 minutes before notifying your on-call channel.

Use warning thresholds for early investigation and alert thresholds for user-facing risk. If every release triggers an alert, your thresholds are too sensitive or your tags are too broad.

Step 8: Connect traces to prompt evaluation

Datadog tells you what happened in production. It does not tell you which prompt candidate should ship next, which dataset cases failed, or whether a prompt improved factuality before release. Use a dedicated prompt workflow for that.

A practical setup looks like this:

  1. Use PromptLayer or your prompt platform to version prompts and run evaluations before release.
  2. Deploy a prompt version with a clear ID, such as support_answer_v4.2.1.
  3. Tag every Datadog trace and metric with that prompt version.
  4. Watch production latency, cost, errors, and quality signals in Datadog.
  5. Send failed or low-rated production cases back into your evaluation dataset after redaction.

This keeps production monitoring connected to prompt development. For more complete tracing and prompt workflow coverage, review PromptLayer observability alongside Datadog.

Step 9: Check your setup with a test release

Before you rely on your Datadog setup, run a small test release in staging and then production. Use 10 to 20 known requests that cover your main workflows.

Validation checklist

  • Can you find a single request by request ID?
  • Does the trace include the HTTP request, retrieval step, tool calls, LLM call, and response parser?
  • Do spans include service, env, version, model, and prompt_version?
  • Are staging and production separated?
  • Are raw sensitive prompts excluded or redacted?
  • Do async jobs and background workers appear in traces?
  • Do dashboard widgets show p95 latency, cost, token usage, and errors?
  • Do monitors notify the right channel with enough context?
  • Can you tie a production issue back to a prompt version and deployment?

Common mistakes to avoid

Logging raw sensitive prompts

Prompt text often contains customer data, internal documents, credentials, or regulated information. Use redaction, field allowlists, prompt hashes, and short retention windows. Do not send raw prompts to Datadog unless your legal, security, and customer commitments allow it.

Missing background LLM calls

Teams often instrument the API route and forget queue workers, scheduled jobs, retries, and agent tool loops. Trace every path that can call a model.

Failing to tag model and prompt versions

Without version tags, you cannot compare releases. Add prompt version, model, provider, app version, and environment to every span and metric.

Mixing staging and production

Staging traffic has test prompts, fake users, debug settings, and unstable branches. Keep it separate with strict env tags and dashboard filters.

Tracking only infrastructure metrics

CPU, memory, and container restarts matter, but they will not explain a model cost spike, a prompt regression, or a bad retrieval result. Track LLM-specific metrics.

Treating Datadog as a prompt evaluation system

Datadog monitors production behavior. It does not replace evaluation datasets, prompt comparison, release gates, or experiment tracking. Use Datadog for runtime telemetry and use a prompt platform for prompt iteration and evaluation.

Production-ready setup pattern

A strong LLM observability setup usually follows this pattern:

  • Datadog APM: request traces, services, errors, latency, queue jobs, infrastructure context.
  • Datadog LLM observability: model calls, token usage, cost, LLM spans, provider performance.
  • PromptLayer: prompt versions, prompt history, datasets, evaluations, release comparison, prompt traces.
  • Evaluation jobs: regression tests, golden datasets, judge scores, task-specific quality checks.
  • Release gates: block prompt or model changes that fail latency, cost, or quality thresholds.

If your team is building agents or prompt chains, add step-level tracing. Each retrieval call, tool call, model call, and parser step should have its own span. This makes failures easier to isolate when an agent loops, selects the wrong tool, or fills the context window with low-value documents.


PromptLayer helps AI teams manage prompt versions, run evaluations, trace LLM workflows, and connect production behavior back to prompt development. To start building a stronger LLM engineering workflow, create a PromptLayer account.

The first platform built for prompt engineering