Tracking LLM Analytics in PostHog: A Guide for AI Teams

How to Track LLM Analytics in PostHog

PostHog is useful when you want to connect LLM behavior to product behavior. If your app uses prompts, agents, retrieval, or generated responses, PostHog can help you answer questions like:

Which prompt version leads to higher user completion?
Which model has the worst latency for paying customers?
Do users abandon the flow after a refusal, hallucination report, or tool failure?
Which agent step causes the most retries?
Did the new prompt improve activation, support resolution, or task success?

PostHog should sit next to your LLM logging, tracing, and evaluation system. It is strong at product analytics, funnels, cohorts, feature flags, and user behavior. It is not enough by itself for full LLM observability, where you also need request-level traces, prompt versions, model inputs, outputs, tool calls, evaluator scores, datasets, and debugging workflows.

The best setup is simple: log safe LLM events in PostHog, keep detailed traces in your LLM platform, and connect the two with stable IDs.

Start with the analytics questions

Do not begin by logging every possible LLM field. Start with the decisions your team needs to make.

For example, a support agent team might care about:

Resolution rate by prompt version
Escalation rate after an AI answer
Average model cost per resolved ticket
p95 latency for the first response
User satisfaction after AI-assisted support

A coding assistant team might care about:

Accepted suggestions by prompt version
Retries per task
Tool call failures by agent step
Completion rate for multi-step workflows
Cost per successful task

A document analysis product might care about:

Extraction accuracy by document type
Manual correction rate
Time saved per workflow
Failure rate by model and file size bucket
Customer churn risk after repeated low-confidence outputs

Once you know the questions, define the events and properties that answer them.

Use a small set of LLM events

You do not need 30 event types. A small, consistent event model is easier to query and maintain.

Recommended event names

llm_request_started: Fired before the model call begins.
llm_request_completed: Fired after a successful response.
llm_request_failed: Fired when the provider, tool call, timeout, or validation step fails.
llm_output_rated: Fired when a user gives feedback, such as thumbs up or thumbs down.
llm_task_completed: Fired when the user or agent completes the actual product task.

The key distinction is important. A model response is not the same as a successful user outcome. If you only track model completions, you will miss whether the output helped the user finish the task.

Track the right properties

Every LLM analytics event should include enough context to segment behavior without exposing sensitive data.

Core properties

trace_id: Shared ID used to connect PostHog events to your LLM trace.
request_id: Unique ID for the model request.
prompt_version_id: Stable prompt version used for the request.
prompt_name: Human-readable prompt name, such as support_answer_v3.
model: Model name, such as gpt-4.1-mini, claude-3-5-sonnet, or gemini-1.5-pro.
provider: OpenAI, Anthropic, Google, Azure OpenAI, or another provider.
environment: production, staging, or development.
latency_ms: End-to-end model latency.
input_tokens: Prompt and context token count.
output_tokens: Completion token count.
estimated_cost_usd: Estimated request cost.
status: success, error, timeout, blocked, or validation_failed.

Product outcome properties

task_type: support_reply, code_generation, document_extraction, or another workflow type.
task_completed: Boolean.
user_feedback: positive, negative, neutral, or empty.
escalated_to_human: Boolean for support or review workflows.
retry_count: Number of retries in the same task.
accepted_output: Boolean for generated content, code, or suggestions.
edited_output: Boolean if the user modified the response before using it.

Evaluation properties

If you run automated checks, include their results as event properties. Keep the fields compact and easy to chart.

eval_passed: Boolean.
eval_score: Numeric score, usually 0 to 1 or 0 to 100.
eval_name: Such as support_accuracy_v2 or json_schema_validity.
eval_reason_code: Such as missing_citation, unsafe_advice, or wrong_format.

If your team is still defining scoring methods, start with a clear LLM evaluation process before treating eval scores as product metrics. You can also use LLM-as-a-judge for review tasks, but you should calibrate it against human-reviewed examples before using it to make release decisions.

Do not log raw prompts or raw outputs by default

One of the easiest mistakes is sending raw prompts, retrieved documents, user messages, or model outputs into product analytics. That can create privacy, security, and compliance problems quickly.

Instead, send safe references and derived fields:

Use prompt_version_id instead of raw prompt text.
Use prompt_hash if you need to detect unexpected prompt changes.
Use document_type instead of document content.
Use input_length_bucket instead of raw user input.
Use output_category instead of raw model output.
Use trace_id to open the full trace in your LLM logging system when an authorized engineer needs to debug.

A safe payload can still answer most analytics questions.

{
  "event": "llm_request_completed",
  "distinct_id": "user_8f3a_hash",
  "properties": {
    "trace_id": "trace_01J7ZP8E9K4VQ2",
    "request_id": "req_01J7ZP8F3N1A6",
    "prompt_name": "support_answer",
    "prompt_version_id": "pv_2026_06_04_003",
    "prompt_hash": "sha256:9f2c...",
    "provider": "openai",
    "model": "gpt-4.1-mini",
    "environment": "production",
    "task_type": "support_reply",
    "latency_ms": 1840,
    "input_tokens": 1284,
    "output_tokens": 312,
    "estimated_cost_usd": 0.0048,
    "status": "success",
    "input_length_bucket": "1k_2k_tokens",
    "output_category": "answer_with_citation",
    "trace_url": "https://your-llm-platform.example/traces/trace_01J7ZP8E9K4VQ2"
  }
}

Notice what is missing: the raw user message, retrieved context, prompt body, and model output. Keep those in a purpose-built trace store with access controls and retention rules.

Send events from your backend

For most LLM applications, capture LLM events on the server. The backend has the model name, token usage, latency, provider errors, prompt version, and trace ID. The browser usually does not.

Here is a simple Node.js example using PostHog’s server-side SDK pattern:

import { PostHog } from "posthog-node";

const posthog = new PostHog(process.env.POSTHOG_API_KEY, {
  host: "https://app.posthog.com"
});

async function answerSupportQuestion({ userId, question, accountId }) {
  const traceId = crypto.randomUUID();
  const requestId = crypto.randomUUID();
  const startedAt = Date.now();

  posthog.capture({
    distinctId: userId,
    event: "llm_request_started",
    properties: {
      trace_id: traceId,
      request_id: requestId,
      account_id: accountId,
      prompt_name: "support_answer",
      prompt_version_id: "pv_2026_06_04_003",
      provider: "openai",
      model: "gpt-4.1-mini",
      environment: process.env.NODE_ENV
    }
  });

  try {
    const result = await callModel({
      question,
      traceId,
      promptVersionId: "pv_2026_06_04_003"
    });

    posthog.capture({
      distinctId: userId,
      event: "llm_request_completed",
      properties: {
        trace_id: traceId,
        request_id: requestId,
        account_id: accountId,
        prompt_name: "support_answer",
        prompt_version_id: "pv_2026_06_04_003",
        provider: "openai",
        model: "gpt-4.1-mini",
        environment: process.env.NODE_ENV,
        task_type: "support_reply",
        latency_ms: Date.now() - startedAt,
        input_tokens: result.usage.input_tokens,
        output_tokens: result.usage.output_tokens,
        estimated_cost_usd: result.estimated_cost_usd,
        status: "success",
        trace_url: `https://your-llm-platform.example/traces/${traceId}`
      }
    });

    return result.answer;
  } catch (error) {
    posthog.capture({
      distinctId: userId,
      event: "llm_request_failed",
      properties: {
        trace_id: traceId,
        request_id: requestId,
        account_id: accountId,
        prompt_name: "support_answer",
        prompt_version_id: "pv_2026_06_04_003",
        provider: "openai",
        model: "gpt-4.1-mini",
        environment: process.env.NODE_ENV,
        latency_ms: Date.now() - startedAt,
        status: "error",
        error_type: error.name,
        error_code: error.code || "unknown"
      }
    });

    throw error;
  }
}

In production, hash or alias user IDs according to your privacy model. Avoid sending emails, names, company secrets, API keys, source code, or raw documents as event properties.

Connect LLM requests to user outcomes

The most useful LLM analytics work happens when you connect model behavior to product outcomes.

For example, do not stop at this event:

{
  "event": "llm_request_completed",
  "properties": {
    "prompt_version_id": "pv_2026_06_04_003",
    "model": "gpt-4.1-mini",
    "latency_ms": 1840,
    "status": "success"
  }
}

Add downstream events that tell you whether the response worked:

{
  "event": "support_reply_sent",
  "distinct_id": "user_8f3a_hash",
  "properties": {
    "trace_id": "trace_01J7ZP8E9K4VQ2",
    "prompt_version_id": "pv_2026_06_04_003",
    "ticket_id_hash": "ticket_91ab_hash",
    "edited_output": true,
    "time_to_send_seconds": 74
  }
}

{
  "event": "support_ticket_resolved",
  "distinct_id": "user_8f3a_hash",
  "properties": {
    "trace_id": "trace_01J7ZP8E9K4VQ2",
    "prompt_version_id": "pv_2026_06_04_003",
    "resolved": true,
    "escalated_to_human": false,
    "customer_satisfaction_score": 5
  }
}

Now you can ask better questions:

Which prompt version has the highest ticket resolution rate?
Do longer outputs lead to more edits?
Does p95 latency affect completion?
Does a cheaper model increase escalation rate?
Do users trust answers with citations more than answers without citations?

Use prompt version IDs everywhere

Missing prompt version IDs make LLM analytics hard to trust. If you cannot segment by prompt version, you cannot tell whether a behavior change came from the prompt, the model, retrieval, routing, or the product UI.

At minimum, track these fields together:

prompt_name: Stable logical name.
prompt_version_id: Immutable version ID.
model: Exact model used.
temperature: If relevant to the workflow.
retrieval_config_id: If the prompt uses RAG.
agent_config_id: If the request is part of an agent workflow.

If your system composes prompts, tools, and steps into larger workflows, treat the chain or agent configuration as a versioned object too. For more complex execution plans, concepts like an LLM compiler can help teams reason about how prompts and calls are structured before runtime.

Build dashboards that engineers can act on

A dashboard should help you decide what to fix. A chart that shows total LLM calls is usually less useful than a chart that shows which prompt version caused a failure spike.

Dashboard 1: LLM traffic and reliability

Chart: Count of llm_request_completed by day.
Breakdown: model, provider, and prompt_name.
Chart: Error rate using llm_request_failed divided by total LLM requests.
Breakdown: error_type, provider, and environment.

Use this dashboard to catch provider incidents, timeout spikes, and accidental traffic changes after releases.

Dashboard 2: Latency and cost

Chart: p50, p95, and p99 latency_ms for llm_request_completed.
Breakdown: model, prompt_version_id, and input_length_bucket.
Chart: Sum of estimated_cost_usd by day.
Chart: Average cost per llm_task_completed.

This helps you see when a prompt change added too much context, when a model swap increased latency, or when an agent loop started making extra calls.

Dashboard 3: Quality and outcomes

Chart: task_completed rate by prompt_version_id.
Chart: Negative feedback rate from llm_output_rated.
Chart: Escalation rate after AI answer.
Chart: Average eval_score by prompt version.

This dashboard connects model behavior to user outcomes. It is where you check whether a prompt change actually helped.

Dashboard 4: Agent workflow health

Chart: Tool call failure rate by tool_name.
Chart: Retry count by agent_step_name.
Chart: Completion rate by agent_config_id.
Chart: Cost per completed agent task.

For agents, request-level analytics are not enough. You need step-level events so you can see whether failures come from planning, retrieval, tool use, validation, or final response generation.

Add debugging links to every useful event

A common failure mode is creating dashboards that show a spike but give engineers no way to inspect examples. If a chart shows that support_answer version pv_2026_06_04_003 has a 12% negative feedback rate, the next click should take you to real traces.

Add one or more of these fields to your PostHog events:

trace_id
trace_url
request_id
prompt_version_id
dataset_example_id, if the issue came from an evaluation set

A trace-to-analytics workflow should look like this:

User asks a question in your app.
Your backend creates trace_id.
Your LLM logging system records the prompt, inputs, outputs, tool calls, and eval results under that trace_id.
Your backend sends safe event properties to PostHog with the same trace_id.
A PostHog dashboard shows a spike in failures for one prompt version.
An engineer filters by that prompt version, opens a few trace_url values, and inspects the exact failing requests in the LLM trace system.
The team fixes the prompt, retrieval config, tool schema, or model route.
The new version ships behind a feature flag or controlled rollout.

This loop keeps PostHog focused on product analytics while your LLM platform handles the request-level debugging details.

Use PostHog feature flags for prompt rollouts

PostHog feature flags work well for controlled LLM changes. You can route 5% of traffic to a new prompt version, compare outcomes, then increase rollout if the metrics look safe.

For example:

Control: support_answer version pv_2026_05_21_001
Treatment: support_answer version pv_2026_06_04_003
Primary metric: Ticket resolution rate
Guardrail metric: Escalation rate must not increase by more than 2%
Guardrail metric: p95 latency must stay below 4 seconds
Guardrail metric: Average cost per ticket must stay below $0.03

Make sure your event payload includes both the feature flag variant and the prompt version ID. Flags tell you which experiment the user entered. Prompt version IDs tell you what actually ran.

{
  "event": "llm_task_completed",
  "distinct_id": "user_8f3a_hash",
  "properties": {
    "trace_id": "trace_01J7ZP8E9K4VQ2",
    "task_type": "support_reply",
    "task_completed": true,
    "prompt_name": "support_answer",
    "prompt_version_id": "pv_2026_06_04_003",
    "posthog_flag": "support-answer-prompt-test",
    "posthog_flag_variant": "treatment",
    "model": "gpt-4.1-mini",
    "estimated_cost_usd": 0.0048,
    "latency_ms": 1840
  }
}

Watch out for common mistakes

Logging raw prompts with sensitive data

Raw prompts often contain user input, private documents, internal policies, support ticket text, code, or credentials. Keep raw data in a secure LLM trace system with clear retention and access controls. Send PostHog IDs, hashes, categories, and metrics.

Forgetting prompt version IDs

Without prompt version IDs, your dashboard may show that quality dropped last Thursday, but you will struggle to connect the drop to a specific prompt edit. Treat the version ID as required metadata for every LLM event.

Tracking only page events

Page views and button clicks will not explain LLM behavior. Track model requests, agent steps, validation failures, user feedback, and task outcomes.

Measuring model output without measuring user outcome

A response can be well-formed and still fail the user. Track whether the user accepted it, edited it, retried, escalated, converted, resolved the issue, or abandoned the flow.

Building dashboards without debugging links

A chart without trace links slows down incident response. Include trace_id and trace_url in the event properties that matter most.

A practical implementation checklist

Define 3 to 5 product questions your LLM analytics should answer.
Create a small event schema for requests, failures, feedback, and task outcomes.
Generate a trace_id for each LLM workflow.
Send safe server-side events to PostHog.
Include prompt_version_id, model, provider, latency, tokens, cost, and status.
Do not send raw prompts, retrieved context, private user input, or raw outputs to PostHog by default.
Connect LLM events to downstream outcomes like completion, acceptance, escalation, and revenue.
Add trace_url so engineers can debug chart spikes quickly.
Create dashboards for reliability, latency, cost, quality, and agent workflow health.
Use feature flags for prompt and model rollouts.

Final thought

PostHog can give your team a clear view of how LLM behavior affects your product. The important part is the data model. Track safe, structured LLM events. Tie every request to a prompt version. Connect outputs to user outcomes. Add trace links so engineers can move from a chart to a real example.

That combination gives you analytics your product and engineering teams can trust without turning PostHog into a raw prompt log.

PromptLayer helps teams manage prompt versions, trace LLM requests, run evaluations, and connect production behavior back to the prompts and workflows that caused it. If you are building LLM analytics in PostHog and want request-level debugging, prompt management, and evals alongside it, create a PromptLayer account.

How to Choose LLM Tracking Tools

How to Track LLM Analytics in PostHog

How to Track LLM Analytics in PostHog

Start with the analytics questions

Use a small set of LLM events

Recommended event names

Track the right properties

Core properties

Product outcome properties

Evaluation properties

Do not log raw prompts or raw outputs by default

Send events from your backend

Connect LLM requests to user outcomes

Use prompt version IDs everywhere

Build dashboards that engineers can act on

Dashboard 1: LLM traffic and reliability

Dashboard 2: Latency and cost

Dashboard 3: Quality and outcomes

Dashboard 4: Agent workflow health

Add debugging links to every useful event

Use PostHog feature flags for prompt rollouts

Watch out for common mistakes

Logging raw prompts with sensitive data

Forgetting prompt version IDs

Tracking only page events

Measuring model output without measuring user outcome

Building dashboards without debugging links

A practical implementation checklist

Final thought

How to Choose LLM Tracking Tools

How to Start Prompt Versioning

How to Compare LLM Outputs in CI

The first platform built for prompt engineering

Usage

Company

Follow Us

How to Track LLM Analytics in PostHog

How to Track LLM Analytics in PostHog

Start with the analytics questions

Use a small set of LLM events

Recommended event names

Track the right properties

Core properties

Product outcome properties

Evaluation properties

Do not log raw prompts or raw outputs by default

Send events from your backend

Connect LLM requests to user outcomes

Use prompt version IDs everywhere

Build dashboards that engineers can act on

Dashboard 1: LLM traffic and reliability

Dashboard 2: Latency and cost

Dashboard 3: Quality and outcomes

Dashboard 4: Agent workflow health

Add debugging links to every useful event

Use PostHog feature flags for prompt rollouts

Watch out for common mistakes

Logging raw prompts with sensitive data

Forgetting prompt version IDs

Tracking only page events

Measuring model output without measuring user outcome

Building dashboards without debugging links

A practical implementation checklist

Final thought

RECENT ARTICLES

The first platform built for prompt engineering

Usage

Company

Follow Us