Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
310 changes: 310 additions & 0 deletions data/blogs/tracking-llm-costs-in-production.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,310 @@
---
title: 'Tracking LLM Costs in Production: Per-Model, Per-Request, Per-User Attribution'
date: '2026-03-31'
lastmod: '2026-03-31'
tags: ['openlit', 'cost-tracking', 'finops', 'llm', 'opentelemetry', 'production']
draft: false
summary: Break down LLM costs by model, service, user, and environment using OpenLIT. Auto-calculated token costs exported as OpenTelemetry metrics to Grafana, Datadog, or any backend.
authors: ['OpenLIT']
images: ['/static/images/llm-cost-tracking-production.png']
---

# Tracking LLM Costs in Production: Per-Model, Per-Request, Per-User Attribution

**TL;DR:** OpenLIT auto-calculates the cost of every LLM call based on model, token count, and a configurable pricing table. Costs are exported as OpenTelemetry metrics, so you can break them down by service, environment, model, or any custom attribute — and send the data to Grafana, Datadog, or wherever you already monitor things.

---

## The Problem: Your LLM Bill Is a Black Box

You get an invoice from OpenAI at the end of the month. It says $4,200. You have questions:

- Which service spent the most?
- Was it the summarization feature or the chatbot?
- Did someone's runaway test loop burn $800 over the weekend?
- Is GPT-4o actually worth the premium over GPT-4o-mini for this use case?

The provider dashboard gives you total tokens and total cost. It doesn't tell you which part of your application consumed what. And if you're using multiple providers (OpenAI + Anthropic + Bedrock), you're reconciling across three different billing dashboards.

You need per-request cost attribution at the application level.

## How OpenLIT Tracks Costs

When you call `openlit.init()`, every LLM request is automatically traced with:

- **Model name** (e.g., `gpt-4o`, `claude-sonnet-4-20250514`)
- **Input tokens** (prompt tokens)
- **Output tokens** (completion tokens)
- **Calculated cost** (based on the model's pricing)

The cost calculation happens inside the SDK using a pricing table. Here's the flow:

```
LLM Call → SDK intercepts → counts tokens → looks up price → emits span + metric
```

### A Minimal Example

```python
import openlit
from openai import OpenAI

openlit.init(
otlp_endpoint="http://localhost:4318",
application_name="summarizer",
environment="production",
)

client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Summarize this quarterly report..."}],
)
```

The trace span for this request will include attributes like:

```
gen_ai.usage.input_tokens: 1420
gen_ai.usage.output_tokens: 380
gen_ai.usage.cost: 0.0142
gen_ai.request.model: gpt-4o
gen_ai.system: openai
deployment.environment: production
service.name: summarizer
```

That `gen_ai.usage.cost` value is auto-calculated. You didn't have to look up pricing or do any math.

## How Pricing Works

OpenLIT ships with a built-in pricing table that covers major models from OpenAI, Anthropic, Cohere, Mistral, Google, and others. The table is maintained in a `pricing.json` file that maps model names to per-token costs.

A simplified version looks like this:

```json
{
"chat": {
"gpt-4o": {"input": 0.0025, "output": 0.01},
"gpt-4o-mini": {"input": 0.00015, "output": 0.0006},
"claude-sonnet-4-20250514": {"input": 0.003, "output": 0.015}
},
"embeddings": {
"text-embedding-3-small": 0.00002,
"text-embedding-ada-002": 0.0001
},
"images": {
"dall-e-3": {
"standard": {"1024x1024": 0.040}
}
}
}
```

Prices are per 1,000 tokens (for chat/completions) or per unit (for images/embeddings).

### Using Custom Pricing

If you're using a model that isn't in the default table — say a fine-tuned model or a provider with custom pricing — you can supply your own:

```python
openlit.init(
pricing_json="/path/to/my-pricing.json",
)
```

Or pass a URL:

```python
openlit.init(
pricing_json="https://internal.example.com/llm-pricing.json",
)
```

The SDK fetches and caches it at startup. Use the same JSON structure as the default table, and your custom models will get accurate cost tracking.

## Breaking Down Costs by Dimension

Once cost data is flowing, you can slice it by any attribute attached to the trace or metric. The most useful breakdowns:

### By Service / Application

If you set `application_name` per service, costs naturally break down:

```python
# Service A
openlit.init(application_name="chatbot")

# Service B
openlit.init(application_name="summarizer")

# Service C
openlit.init(application_name="code-review-agent")
```

Now you can answer: "The chatbot costs $2,100/month, the summarizer costs $1,400/month, and the code-review agent costs $700/month."

### By Model

Every span includes the model name, so you can aggregate cost by model:

- GPT-4o: $2,800/month
- GPT-4o-mini: $600/month
- Claude Sonnet: $800/month

This helps you decide when to downgrade. If GPT-4o-mini gives 90% of the quality for 15% of the cost on your summarization task, the numbers make the decision obvious.

### By Environment

```python
openlit.init(environment="production") # vs "staging" vs "development"
```

If your staging environment is burning $500/month on LLM calls, you probably want to know about it. Common fix: use a cheaper model in staging or add rate limits.

### By User (Custom Attributes)

To track costs per user, you need to add the user ID as a span attribute. OpenTelemetry makes this straightforward:

```python
from opentelemetry import trace

tracer = trace.get_tracer(__name__)

def handle_request(user_id: str, message: str):
with tracer.start_as_current_span("user-request") as span:
span.set_attribute("user.id", user_id)

response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": message}],
)
return response.choices[0].message.content
```

Now the LLM span (auto-created by OpenLIT) is a child of your `user-request` span. In your dashboard, you can group costs by `user.id`.

## Building Cost Dashboards

### In the OpenLIT Platform

The self-hosted OpenLIT dashboard includes built-in cost views:

1. **Total cost over time** — see daily/weekly/monthly trends
2. **Cost by model** — bar chart breaking down spend per model
3. **Cost by application** — which service is costing you the most
4. **Individual request costs** — drill into specific expensive calls

You can also build custom dashboards with the dashboard builder, adding widgets that query ClickHouse directly.

### In Grafana

Since OpenLIT exports OTLP, you can build Grafana dashboards with:

- **Prometheus/Mimir** for cost metrics (histograms, counters)
- **Tempo** for trace details with cost attributes

Example PromQL for daily cost by model:

```
sum by (gen_ai_request_model) (
rate(gen_ai_usage_cost_total[24h])
)
```

### In Datadog

Send OTLP to Datadog's OTLP endpoint. Cost data shows up as custom metrics. Create monitors like:

- Alert if daily cost exceeds $X
- Alert if a single request costs more than $Y (indicates a runaway prompt)
- Weekly cost trend report by service

## Setting Up Budget Alerts

The combination of per-request cost tracking and standard metrics backends gives you alerting for free:

**Alert: Daily spend exceeds budget**

Set up a Grafana or Datadog alert on the cumulative daily cost metric. If `sum(gen_ai.usage.cost)` over the last 24 hours exceeds your threshold, fire an alert.

**Alert: Anomalous request cost**

Some requests cost 100x the average because of unexpectedly long prompts or completions. Track the p99 of `gen_ai.usage.cost` and alert if it suddenly spikes.

**Alert: New model appeared**

If someone deploys code that uses an expensive model you didn't approve, you'll see a new `gen_ai.request.model` value in your metrics. Alert on new label values.

## Cost Optimization Strategies

Once you have visibility, optimization follows naturally:

**Switch models where quality allows.** Compare `gpt-4o` vs `gpt-4o-mini` cost with side-by-side quality (use OpenLIT's OpenGround for this). If quality is similar, switch and save 80%.

**Cache repeated prompts.** If you see the same prompt pattern in traces (e.g., summarization of the same document), add a cache layer. Zero LLM cost for cache hits.

**Reduce context length.** If your RAG pipeline stuffs 10 documents into context but the LLM only uses 2, reduce the context window. Fewer input tokens = lower cost.

**Set max_tokens.** If your completion only needs 100 tokens, set `max_tokens=100`. This prevents the model from generating unnecessarily long responses.

**Batch where possible.** Some providers offer lower per-token pricing for batch API calls. If latency isn't critical, batch requests.

## A Complete Cost Tracking Setup

Here's a production-ready setup with cost tracking, custom attributes, and Grafana export:

```python
import openlit
from openai import OpenAI
from opentelemetry import trace

openlit.init(
otlp_endpoint="https://grafana-otlp.example.com/otlp",
otlp_headers={"Authorization": "Bearer YOUR_GRAFANA_TOKEN"},
application_name="my-saas-api",
environment="production",
)

client = OpenAI()
tracer = trace.get_tracer(__name__)

def generate_response(user_id: str, tier: str, prompt: str) -> str:
model = "gpt-4o" if tier == "enterprise" else "gpt-4o-mini"

with tracer.start_as_current_span("generate") as span:
span.set_attribute("user.id", user_id)
span.set_attribute("user.tier", tier)

response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
)
return response.choices[0].message.content
```

Now you can answer: "Enterprise users cost us $X/month on GPT-4o, free-tier users cost $Y/month on GPT-4o-mini."

---

## FAQ

**How do I add custom model pricing?**

Create a JSON file following the same structure as the default `pricing.json` and pass it to `openlit.init(pricing_json="/path/to/custom.json")`. You can also pass a URL to load pricing from a remote server.

**Does it work with fine-tuned models?**

Yes. Add your fine-tuned model's name and pricing to a custom pricing JSON. The model name in the JSON must match the model name you pass to the provider's API.

**What if the pricing table is outdated?**

The default pricing table is updated with each SDK release. Between releases, you can override with a custom JSON pointing to a URL that you control and update as needed.

**How accurate is the cost calculation?**

It's based on the token count reported by the provider and the per-token price in the pricing table. For chat/completion models, accuracy is very high. For image and embedding models, it depends on the pricing model (per-image, per-token, etc.).

**Can I track costs across multiple providers?**

Yes. OpenLIT instruments all providers uniformly. If a request goes to OpenAI and another to Anthropic, both get cost attributes. Aggregate them in your dashboard for a unified view.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.