This walks through the included example database (examples/requests.db) to show how a real Claude Code conversation generates API requests and how the KV cache behaves.
Note on streaming duplicates: Older
requests.dbfiles from claude-code-proxy contain paired streaming + non-streaming requests per turn — the proxy's broken SSE forwarding caused Claude Code to replay each turn as a non-streaming request. The streaming requests are artifacts with incomplete metadata. This example database has been cleaned to contain only the non-streaming requests (which have completestop_reason, content types, and usage data). See seifghazi/claude-code-proxy#33 for the proxy fix.
"Can you clone a version of vllm locally and then fully analyze its full code path and write me a markdown file about key components."
This single prompt generated ~92 streaming API requests over ~17 minutes. The parent conversation (claude-opus-4-6) spawned 4 sub-agents (claude-haiku-4-5) to explore the codebase in parallel.
STREAM opus 25,407 tokens in → 205 out | cache: 83.6% | api: 5.8s
The parent sends the user's prompt. The 25K input tokens are: tools (~15K) + system prompt (~6K) + user message.
The first request already shows 83.6% cache hit because the API's global cache keeps tool definitions warm from other sessions.
STREAM opus 25,627 tokens in → 123 out | cache: 99.1% | api: 8.6s | think: 4.0s
STREAM opus 25,830 tokens in → 822 out | cache: 99.2% | api: 14.8s | stop: tool_use
The parent clones the vLLM repo and inspects it. Context grows from 25K to 26K as tool results are added. Cache stays at 99% -- only the new tool result tokens at the end need processing. The 4-second think_time is the client executing the tool (git clone) before sending the next request.
STREAM haiku 13,206 tokens in → 148 out | cache: 0.0% | api: 2.1s <- SA1
STREAM haiku 13,249 tokens in → 101 out | cache: 0.0% | api: 1.5s <- SA2
STREAM haiku 13,255 tokens in → 107 out | cache: 0.0% | api: 1.4s <- SA3
STREAM haiku 13,215 tokens in → 109 out | cache: 0.0% | api: 1.6s <- SA4
All 4 Explore agents fire within the same second (15:43:05). Their first requests all show 0% cache -- not only because they use different tools than the parent, but because all 4 are in-flight simultaneously. The cache hasn't been written by any of them yet when the others arrive.
On the second turn, the sub-agents benefit from each other -- all 4 share identical tool definitions (~10K tokens), so the cache created by each sub-agent's first request is reusable by the others:
STREAM haiku 13,397 tokens in → 91 out | cache: 98.6% <- SA1 turn 2
STREAM haiku 13,868 tokens in → 91 out | cache: 95.3% <- SA4 turn 2
STREAM haiku 13,500 tokens in → 83 out | cache: 98.2% <- SA3 turn 2
STREAM haiku 13,871 tokens in → 86 out | cache: 91.5% <- SA2 turn 2
The ~10K shared tool prefix is now warm from turn 1, giving 91-98% cache hits.
~84 streaming requests from 4 sub-agents running in parallel, each exploring different parts of the vLLM codebase. The pattern is consistent:
- Context grows from ~13K to ~90K tokens as each agent reads files and accumulates conversation history
- Within each sub-agent: 93-100% cache hit (prefix preserved, only new tool results added)
- Across sub-agents: shared tool prefix (~10K tokens) stays warm in global cache
SA1: ~21 requests, grows to 73K tokens
SA2: ~14 requests, grows to 89K tokens
SA3: ~26 requests, grows to 83K tokens
SA4: ~23 requests, grows to 73K tokens
STREAM opus 44,604 tokens in → 6,421 out | cache: 57.9% | api: 120s
The parent resumes after all sub-agents complete. The cache hit drops to 58% -- but not because the cache expired. The breakdown:
| Component | Tokens | Status |
|---|---|---|
| Original prefix (tools + system + early messages) | 25,829 | Cached -- still warm from Phase 2 |
| Sub-agent results (new content injected) | 5,954 | Created -- never cached before |
| Request overhead | 12,821 | Uncached |
The context grew from 26K to 45K tokens because the sub-agent results were injected into the conversation. The old prefix is 100% cached; the "low" percentage is simply the ratio of old-to-new content.
By the next request, the new content is also cached:
STREAM opus 50,916 tokens in → 441 out | cache: 62.4% | api: 11s | stop: end_turn
The parent writes the architecture summary (441 output tokens) and the conversation ends with end_turn.
STREAM opus 51,407 tokens in → 108 out | cache: 99.0% | api: 3.3s | think: 315s
The user reads the output for ~5 minutes (think_time: 315s), then asks "Rewrite this in pirate speak." The cache is still warm -- 99% hit despite the long pause. The think_time field captures this real-world user behavior pattern.
Each request records three timing components:
| Metric | Range | What It Measures |
|---|---|---|
api_time |
0.9s - 197s | Total response time (prefill + decode + streaming) |
ttft |
0.6s - 3.5s | Time to first token -- server-side latency before first token |
think_time |
0s - 315s | Client delay before sending request |
ttft: Derived from theServer-Timingresponse header (orX-Envoy-Upstream-Service-Timefor older proxy versions).- Tool call results:
think_time = 0-4s(tool execution time) - User reading:
think_time = 315s(user thinking/reading) - Sub-agent wait:
think_time = 284s(parent waiting for sub-agents)
These allow the trace replay tester to simulate different server speeds while preserving real client behavior patterns.
| Scenario | Cache Hit Rate | Why |
|---|---|---|
| Within-conversation growth | 93-99% | Prefix preserved, only new suffix tokens |
| First sub-agent request (all 4 concurrent) | 0% | All in-flight simultaneously, cache not yet written |
| Sub-agent turn 2+ | 91-98% | Shared tool prefix warm from other sub-agents' turn 1 |
| Parent after sub-agents | 58% -> 100% | Old prefix cached, new results created, then warm |
| After 5-minute user pause | 99% | Cache lives beyond typical session TTL |
# Setup
python3 build_message_index.py examples/requests.db
python3 build_conversation_index.py examples/requests.db --projects-path examples/jsonl/
# Generate trace with timing data (--anonymize strips IDs and salts hashes)
python3 build_minimal_traces.py examples/requests.db \
--jsonl-dir examples/jsonl/ \
--output-dir examples/traces/ \
--block-size 64 \
--include-subagents \
--anonymize
# Validate against API metrics
python3 validate_trace_cache.py examples/traces/ \
--db examples/requests.db \
--jsonl-dir examples/jsonl/
# Visualize
python3 complete_cache_visualizer.py examples/requests.db \
--conversation-id combined \
--output examples/visualizations/combined.html