| layout | default |
|---|---|
| title | Langfuse Tutorial - Chapter 5: Analytics & Metrics |
| nav_order | 5 |
| has_children | false |
| parent | Langfuse Tutorial |
Welcome to Chapter 5: Analytics & Metrics. In this part of Langfuse Tutorial: LLM Observability, Evaluation, and Prompt Operations, you will build an intuitive mental model first, then move into concrete implementation details and practical production tradeoffs.
Track costs, latency, usage patterns, and ROI of your LLM applications.
Previous: Chapter 4: Evaluation
Once your LLM application is running and traces are flowing into Langfuse, the next question is: how well is it actually performing? Analytics and metrics give you the answers. Langfuse provides a rich dashboard for monitoring costs, latency, error rates, and usage patterns -- all in real time. In this chapter, you will learn how to use those built-in tools, build custom dashboards on top of the data, set up alerts, and run cost optimization strategies that can save you real money.
Before diving in, it helps to understand how analytics data moves through Langfuse:
flowchart LR
A[Your LLM App] -->|Traces, Spans, Generations| B[Langfuse Ingestion API]
B --> C[Event Processing]
C --> D[PostgreSQL Storage]
D --> E[Analytics Engine]
E --> F[Built-in Dashboard]
E --> G[API / Export]
G --> H[Custom Dashboards]
G --> I[BI Tools - Tableau, Looker]
G --> J[Alerting Systems]
F --> K[Cost Charts]
F --> L[Latency Charts]
F --> M[Usage Charts]
Every trace, span, and generation that your application sends is ingested, processed, and stored. The analytics engine then aggregates this data and makes it available through the UI dashboard and the API. You can consume the data directly in Langfuse, export it for external analysis, or pipe it into alerting systems.
Monitoring spending is one of the most immediately useful features of Langfuse. Costs are automatically captured whenever your traces include usage metadata:
# Costs are automatically captured from usage metadata
trace = langfuse.trace(name="customer-query", user_id="user_42")
generation = trace.generation(
name="llm-call",
model="gpt-4o-mini",
usage={
"input": 150, # input tokens
"output": 50, # output tokens
"total": 200, # total tokens
"input_cost": 0.0001, # cost in USD for input
"output_cost": 0.0002,# cost in USD for output
"total_cost": 0.0003, # total cost in USD
}
)If you are using a supported integration (OpenAI SDK wrapper, LangChain callback, etc.), token counts and costs are captured automatically -- you do not need to compute them yourself.
A common first step is understanding which models are eating up your budget:
from collections import defaultdict
from langfuse import Langfuse
langfuse = Langfuse()
traces = langfuse.get_traces(
limit=1000,
filters={"tags": ["production"]}
)
cost_by_model = defaultdict(float)
for trace in traces:
for observation in trace.observations:
if observation.type == "GENERATION" and observation.usage:
model = observation.model or "unknown"
cost_by_model[model] += observation.usage.get("total_cost", 0)
for model, cost in sorted(cost_by_model.items(), key=lambda x: -x[1]):
print(f"{model}: ${cost:.4f}")This gives you a clear picture of which models to focus on when optimizing.
The Langfuse UI provides a built-in dashboard with the most important metrics at a glance:
- Total Traces: Request volume over time, shown as a time-series chart
- Latency: P50, P90, and P95 response times broken down by span type
- Cost: Daily and weekly spending with breakdowns by model and user
- Token Usage: Input versus output tokens, helping you spot verbose prompts
- Error Rate: Percentage of traces that ended in failure
- User Activity: Active users, session counts, and engagement trends
- Score Distribution: How your evaluation scores are distributed over time
You can filter the dashboard by date range, tags, user segments, and more. This makes it easy to compare different time periods or isolate a specific feature for analysis.
Latency is often the most visible metric to your end users. A slow response can ruin the experience, no matter how accurate the answer is.
Langfuse reports latency using percentiles rather than simple averages, because averages can hide problems. Here is what each percentile tells you:
- P50 (median): Half of all requests complete faster than this. This represents the "typical" user experience.
- P90: 90% of requests complete faster than this. This catches the slower tail that affects a meaningful chunk of users.
- P95: 95% of requests complete faster than this. This is the standard SLA metric for most production systems.
- P99: 99% of requests complete faster than this. This reveals extreme outliers that might indicate infrastructure problems.
For example, if your P50 is 1.2 seconds but your P95 is 8.5 seconds, that tells you most users are happy but a significant minority is having a terrible experience. Investigating those slow requests often reveals issues like cold starts, model congestion, or unnecessarily long prompts.
You can break down latency by span type to find bottlenecks:
traces = langfuse.get_traces(
limit=500,
filters={"tags": ["production"]}
)
latencies_by_span = defaultdict(list)
for trace in traces:
for observation in trace.observations:
if observation.start_time and observation.end_time:
duration = (observation.end_time - observation.start_time).total_seconds()
latencies_by_span[observation.name].append(duration)
# Calculate percentiles
import numpy as np
for span_name, durations in latencies_by_span.items():
p50 = np.percentile(durations, 50)
p90 = np.percentile(durations, 90)
p95 = np.percentile(durations, 95)
print(f"{span_name}: P50={p50:.2f}s P90={p90:.2f}s P95={p95:.2f}s")This helps you answer questions like: "Is it the retrieval step or the generation step that is slow?"
While the built-in dashboard covers the basics, you may want custom views tailored to your application. Langfuse's API makes this straightforward.
from datetime import datetime, timedelta
langfuse = Langfuse()
# Get traces from the last 7 days
one_week_ago = (datetime.now() - timedelta(days=7)).isoformat()
traces = langfuse.get_traces(
limit=2000,
filters={
"tags": ["production"],
"date_range": {"gte": one_week_ago},
}
)
# Calculate daily cost breakdown
daily_costs = defaultdict(float)
for trace in traces:
day = trace.timestamp.strftime("%Y-%m-%d")
daily_costs[day] += trace.total_cost or 0
for day in sorted(daily_costs.keys()):
print(f"{day}: ${daily_costs[day]:.2f}")If you want a quick interactive dashboard, Streamlit works well:
import streamlit as st
import pandas as pd
from langfuse import Langfuse
langfuse = Langfuse()
st.title("LLM Application Dashboard")
# Date range selector
days = st.slider("Days to analyze", 1, 30, 7)
traces = langfuse.get_traces(limit=5000, filters={
"date_range": {"gte": (datetime.now() - timedelta(days=days)).isoformat()}
})
# Cost over time
df = pd.DataFrame([{
"date": t.timestamp.date(),
"cost": t.total_cost or 0,
"latency": t.latency or 0,
} for t in traces])
st.subheader("Daily Cost")
st.line_chart(df.groupby("date")["cost"].sum())
st.subheader("Latency Distribution")
st.bar_chart(df["latency"].describe())
st.subheader("Top Users by Cost")
user_costs = defaultdict(float)
for t in traces:
user_costs[t.user_id or "anonymous"] += t.total_cost or 0
st.table(sorted(user_costs.items(), key=lambda x: -x[1])[:10])For larger organizations, you may want to export data into a data warehouse and use BI tools like Tableau, Looker, or Metabase. Langfuse supports data export:
# Export traces to a pandas DataFrame for further analysis
import pandas as pd
traces = langfuse.get_traces(limit=10000, filters={"tags": ["production"]})
df = pd.DataFrame([{
"trace_id": t.id,
"user_id": t.user_id,
"timestamp": t.timestamp,
"latency": t.latency,
"cost": t.total_cost,
"model": t.metadata.get("model", "unknown") if t.metadata else "unknown",
"tags": ",".join(t.tags or []),
} for t in traces])
# Export to CSV for BI tools
df.to_csv("langfuse_export.csv", index=False)
# Or push directly to a data warehouse
# df.to_sql("langfuse_traces", engine, if_exists="append")LLM costs can grow quickly, especially at scale. Here are practical strategies to bring them down, along with concrete examples.
Not every request needs your most expensive model. Route simple queries to cheaper models:
def route_to_model(query: str, complexity_score: float) -> str:
"""Route queries to appropriate models based on complexity."""
if complexity_score < 0.3:
return "gpt-4o-mini" # ~$0.15 per 1M input tokens
elif complexity_score < 0.7:
return "gpt-4o" # ~$2.50 per 1M input tokens
else:
return "gpt-4o" # Use the best model for hard queries
# Track the routing decision in your trace
trace = langfuse.trace(name="routed-query", metadata={
"complexity_score": complexity_score,
"selected_model": model,
"routing_reason": "complexity-based"
})Shorter prompts cost less. Analyze your token usage to find bloated prompts:
# Find traces with high input-to-output token ratios
traces = langfuse.get_traces(limit=1000, filters={"tags": ["production"]})
bloated_prompts = []
for trace in traces:
for obs in trace.observations:
if obs.type == "GENERATION" and obs.usage:
input_tokens = obs.usage.get("input", 0)
output_tokens = obs.usage.get("output", 0)
if input_tokens > 0 and output_tokens > 0:
ratio = input_tokens / output_tokens
if ratio > 10: # Input is 10x the output -- likely bloated
bloated_prompts.append({
"trace_id": trace.id,
"ratio": ratio,
"input_tokens": input_tokens,
})
print(f"Found {len(bloated_prompts)} traces with bloated prompts")If users frequently ask similar questions, cache the responses:
import hashlib
def get_cached_or_generate(query: str, cache: dict) -> str:
cache_key = hashlib.sha256(query.strip().lower().encode()).hexdigest()
if cache_key in cache:
# Log cache hit in Langfuse
trace = langfuse.trace(name="cached-response", metadata={"cache_hit": True})
return cache[cache_key]
# Generate and cache
response = generate_response(query)
cache[cache_key] = response
trace = langfuse.trace(name="generated-response", metadata={"cache_hit": False})
return responseSet hard limits on token usage to prevent runaway costs:
MAX_TOKENS_PER_USER_PER_DAY = 50000
def check_budget(user_id: str) -> bool:
"""Check if user is within their daily token budget."""
today = datetime.now().strftime("%Y-%m-%d")
traces = langfuse.get_traces(
limit=500,
filters={
"user_id": user_id,
"date_range": {"gte": today},
}
)
total_tokens = sum(t.total_tokens or 0 for t in traces)
return total_tokens < MAX_TOKENS_PER_USER_PER_DAYBeyond cost tracking, you want to understand the return on your LLM investment. Here are practical formulas you can implement.
# Cost per successful interaction
traces = langfuse.get_traces(limit=1000, filters={"tags": ["production"]})
total_cost = sum(t.total_cost or 0 for t in traces)
successful = sum(1 for t in traces if t.scores and t.scores.get("success", 0) >= 0.8)
total_traces = len(traces)
cost_per_success = total_cost / successful if successful > 0 else float("inf")
success_rate = successful / total_traces if total_traces > 0 else 0
print(f"Total cost: ${total_cost:.2f}")
print(f"Success rate: {success_rate:.1%}")
print(f"Cost per successful interaction: ${cost_per_success:.4f}")If you can quantify the value of each successful interaction (for example, a resolved support ticket saves $15 in human agent time), you can compute ROI:
# ROI calculation
VALUE_PER_SUCCESS = 15.00 # dollars saved per successful AI resolution
total_value = successful * VALUE_PER_SUCCESS
roi = (total_value - total_cost) / total_cost * 100 if total_cost > 0 else 0
print(f"Value generated: ${total_value:.2f}")
print(f"LLM cost: ${total_cost:.2f}")
print(f"Net value: ${total_value - total_cost:.2f}")
print(f"ROI: {roi:.0f}%")Attach business metadata to your traces so you can correlate LLM performance with business outcomes:
trace = langfuse.trace(
name="customer-support",
user_id="user_42",
metadata={
"ticket_id": "12345",
"resolution_time_minutes": 15,
"customer_satisfaction": 4.5,
"was_escalated": False,
"estimated_value_saved": 15.00,
}
)Over time, you can aggregate this data to show stakeholders exactly how much value the LLM application is delivering.
Setting up alerts ensures that you catch problems before your users do. Here are recommended thresholds for common metrics, along with how to implement them.
| Metric | Warning Threshold | Critical Threshold | Why It Matters |
|---|---|---|---|
| Daily cost | 120% of 7-day average | 200% of 7-day average | Catches unexpected cost spikes |
| P95 latency | > 5 seconds | > 10 seconds | Users abandon slow responses |
| Error rate | > 3% | > 10% | Indicates model or API issues |
| Token usage | > 150% of baseline | > 300% of baseline | May signal prompt injection or loops |
| Success score | < 0.75 average | < 0.5 average | Quality is degrading |
You can build a simple alerting script that runs on a schedule:
import smtplib
from datetime import datetime, timedelta
def check_alerts():
"""Check key metrics and send alerts if thresholds are exceeded."""
yesterday = (datetime.now() - timedelta(days=1)).isoformat()
traces = langfuse.get_traces(
limit=5000,
filters={"date_range": {"gte": yesterday}, "tags": ["production"]}
)
if not traces:
return
# Cost alert
daily_cost = sum(t.total_cost or 0 for t in traces)
if daily_cost > 100: # $100/day threshold
send_alert(f"High daily cost: ${daily_cost:.2f}")
# Latency alert
latencies = [t.latency for t in traces if t.latency]
if latencies:
p95 = sorted(latencies)[int(len(latencies) * 0.95)]
if p95 > 5.0:
send_alert(f"High P95 latency: {p95:.1f}s")
# Error rate alert
errors = sum(1 for t in traces if t.status == "ERROR")
error_rate = errors / len(traces)
if error_rate > 0.05:
send_alert(f"High error rate: {error_rate:.1%}")
def send_alert(message: str):
"""Send alert via email, Slack, PagerDuty, etc."""
print(f"ALERT: {message}")
# In production, integrate with your alerting system:
# - Slack webhook
# - PagerDuty API
# - Email via SMTP
# - Opsgenie, etc.For quick Slack alerts:
import requests
def send_slack_alert(message: str, webhook_url: str):
requests.post(webhook_url, json={
"text": f":warning: Langfuse Alert: {message}",
"channel": "#llm-alerts",
})Export metrics for external analysis, archival, or compliance:
# Export to CSV/JSON for archival
import json
traces = langfuse.get_traces(
limit=10000,
filters={"date_range": {"gte": "2024-12-01"}}
)
# JSON export
export_data = [{
"id": t.id,
"timestamp": t.timestamp.isoformat(),
"user_id": t.user_id,
"cost": t.total_cost,
"latency": t.latency,
"status": t.status,
"tags": t.tags,
} for t in traces]
with open("langfuse_export.json", "w") as f:
json.dump(export_data, f, indent=2)- Set budgets and alerts early. It is much easier to prevent cost overruns than to explain them after the fact.
- Track cost per user and per feature to identify which parts of your application are the most expensive.
- Compare A/B test variants by cost and quality -- a slightly cheaper model that performs nearly as well may be the better choice.
- Archive old traces to control storage costs. Langfuse supports data retention policies.
- Review latency percentiles weekly. Averages hide problems; percentiles reveal them.
- Automate ROI reporting so stakeholders can see the value of your LLM investment without asking.
Analytics and metrics are not just nice-to-have -- they are essential for running an LLM application responsibly. In this chapter, you learned how to track costs across models, analyze latency with percentiles, build custom dashboards, implement cost optimization strategies, calculate ROI with real formulas, and set up alerts to catch problems early. With these tools in place, you can confidently scale your application while keeping costs under control.
Next: Chapter 6: Datasets & Testing -- create test datasets from production traces and run regression tests.
The package module in package.json handles a key part of this chapter's functionality:
{
"name": "langfuse",
"version": "3.163.0",
"author": "engineering@langfuse.com",
"license": "MIT",
"private": true,
"engines": {
"node": "24"
},
"scripts": {
"agents:check": "node scripts/agents/sync-agent-shims.mjs --check",
"agents:sync": "node scripts/agents/sync-agent-shims.mjs",
"postinstall": "node -e \"const fs = require('node:fs'); const cp = require('node:child_process'); if (!fs.existsSync('scripts/postinstall.sh')) { console.log('Skipping repo postinstall helper: scripts/postinstall.sh is not present in this install context.'); process.exit(0); } cp.execSync('bash scripts/postinstall.sh', { stdio: 'inherit' });\"",
"preinstall": "npx only-allow pnpm",
"infra:dev:up": "docker compose -f ./docker-compose.dev.yml up -d --wait",
"infra:dev:down": "docker compose -f ./docker-compose.dev.yml down",
"infra:dev:prune": "docker compose -f ./docker-compose.dev.yml down -v",
"db:generate": "turbo run db:generate",
"db:migrate": "turbo run db:migrate",
"db:seed": "turbo run db:seed",
"db:seed:examples": "turbo run db:seed:examples",
"nuke": "bash ./scripts/nuke.sh",
"dx": "pnpm i && pnpm run infra:dev:prune && pnpm run infra:dev:up --pull always && pnpm --filter=shared run db:reset:test && pnpm --filter=shared run db:reset && pnpm --filter=shared run ch:reset && pnpm --filter=shared run db:seed:examples && pnpm run dev",
"dx-f": "pnpm i && pnpm run infra:dev:prune && pnpm run infra:dev:up --pull always && pnpm --filter=shared run db:reset:test && pnpm --filter=shared run db:reset -f && SKIP_CONFIRM=1 pnpm --filter=shared run ch:reset && pnpm --filter=shared run db:seed:examples && pnpm run dev",
"dx:skip-infra": "pnpm i && pnpm --filter=shared run db:reset:test && pnpm --filter=shared run db:reset && pnpm --filter=shared run ch:reset && pnpm --filter=shared run db:seed:examples && pnpm run dev",
"build": "turbo run build",
"build:check": "turbo run build:check",
"typecheck": "turbo run typecheck",
"tc": "turbo run typecheck",
"start": "turbo run start",
"dev": "turbo run dev",
"dev:worker": "turbo run dev --filter=worker",
"dev:web": "turbo run dev --filter=web",
"dev:web-webpack": "turbo run dev --filter=web -- --webpack",
"lint": "turbo run lint",This module is important because it defines how Langfuse Tutorial: LLM Observability, Evaluation, and Prompt Operations implements the patterns covered in this chapter.
The docker-compose.dev-azure module in docker-compose.dev-azure.yml handles a key part of this chapter's functionality:
services:
clickhouse:
image: docker.io/clickhouse/clickhouse-server:24.3
user: "101:101"
environment:
CLICKHOUSE_DB: default
CLICKHOUSE_USER: ${CLICKHOUSE_USER:-clickhouse}
CLICKHOUSE_PASSWORD: ${CLICKHOUSE_PASSWORD:-clickhouse}
volumes:
- langfuse_clickhouse_data:/var/lib/clickhouse
- langfuse_clickhouse_logs:/var/log/clickhouse-server
ports:
- "8123:8123"
- "9000:9000"
healthcheck:
test: wget --no-verbose --tries=1 --spider http://localhost:8123/ping || exit 1
interval: 5s
timeout: 5s
retries: 10
start_period: 1s
depends_on:
- postgres
azurite:
image: mcr.microsoft.com/azure-storage/azurite
command: azurite-blob --blobHost 0.0.0.0
ports:
- "10000:10000"
volumes:
- langfuse_azurite_data:/data
minio:
image: cgr.dev/chainguard/minio
container_name: ${MINIO_CONTAINER_NAME:-langfuse-minio}
entrypoint: shThis module is important because it defines how Langfuse Tutorial: LLM Observability, Evaluation, and Prompt Operations implements the patterns covered in this chapter.
The docker-compose module in docker-compose.yml handles a key part of this chapter's functionality:
# Make sure to update the credential placeholders with your own secrets.
# We mark them with # CHANGEME in the file below.
# In addition, we recommend to restrict inbound traffic on the host to langfuse-web (port 3000) and minio (port 9090) only.
# All other components are bound to localhost (127.0.0.1) to only accept connections from the local machine.
# External connections from other machines will not be able to reach these services directly.
services:
langfuse-worker:
image: docker.io/langfuse/langfuse-worker:3
restart: always
depends_on: &langfuse-depends-on
postgres:
condition: service_healthy
minio:
condition: service_healthy
redis:
condition: service_healthy
clickhouse:
condition: service_healthy
ports:
- 127.0.0.1:3030:3030
environment: &langfuse-worker-env
NEXTAUTH_URL: ${NEXTAUTH_URL:-http://localhost:3000}
DATABASE_URL: ${DATABASE_URL:-postgresql://postgres:postgres@postgres:5432/postgres} # CHANGEME
SALT: ${SALT:-mysalt} # CHANGEME
ENCRYPTION_KEY: ${ENCRYPTION_KEY:-0000000000000000000000000000000000000000000000000000000000000000} # CHANGEME: generate via `openssl rand -hex 32`
TELEMETRY_ENABLED: ${TELEMETRY_ENABLED:-true}
LANGFUSE_ENABLE_EXPERIMENTAL_FEATURES: ${LANGFUSE_ENABLE_EXPERIMENTAL_FEATURES:-false}
CLICKHOUSE_MIGRATION_URL: ${CLICKHOUSE_MIGRATION_URL:-clickhouse://clickhouse:9000}
CLICKHOUSE_URL: ${CLICKHOUSE_URL:-http://clickhouse:8123}
CLICKHOUSE_USER: ${CLICKHOUSE_USER:-clickhouse}
CLICKHOUSE_PASSWORD: ${CLICKHOUSE_PASSWORD:-clickhouse} # CHANGEME
CLICKHOUSE_CLUSTER_ENABLED: ${CLICKHOUSE_CLUSTER_ENABLED:-false}
LANGFUSE_USE_AZURE_BLOB: ${LANGFUSE_USE_AZURE_BLOB:-false}
LANGFUSE_S3_EVENT_UPLOAD_BUCKET: ${LANGFUSE_S3_EVENT_UPLOAD_BUCKET:-langfuse}
LANGFUSE_S3_EVENT_UPLOAD_REGION: ${LANGFUSE_S3_EVENT_UPLOAD_REGION:-auto}This module is important because it defines how Langfuse Tutorial: LLM Observability, Evaluation, and Prompt Operations implements the patterns covered in this chapter.
flowchart TD
A[package]
B[docker-compose.dev-azure]
C[docker-compose]
A --> B
B --> C