Home › Architecture › Performance
Sub-millisecond cache operations, measured and benchmarked
Tip
Key numbers (p95 latency):
- L1 cache hit: 500ns (pure dict lookup)
- Decorator + L1 hit: 30-50μs (realistic workload)
- Complex payload (10KB dict): 242μs with serialization
- DataFrame (10K rows, Arrow): 800μs total roundtrip
- Concurrent access (10 threads): 231μs (minimal contention)
- Encryption overhead: 1.03x (only 3% slower)
All benchmarks use:
- time.perf_counter_ns(): Nanosecond-precision performance counter
- Statistical rigor: 5 independent runs, 95% confidence intervals
- GC filtering: Exclude garbage collection pauses from measurements
- Warmup: 1,000 iterations before measurement
- Realistic payloads: 10KB dicts, 10K row DataFrames, custom dataclasses
- Production configuration: All reliability features enabled (circuit breaker, backpressure, timeouts)
Run benchmarks yourself:
# Component-level profiling
uv run pytest tests/performance/test_cache_profiler.py -v -s
# End-to-end decorator overhead
uv run pytest tests/performance/test_end_to_end_latency.py -v -s
# Production-realistic scenarios
uv run pytest tests/performance/test_production_realism.py -v -s
# Serializer comparison
uv run pytest tests/performance/test_serializer_benchmarks.py -v -sTotal p95: 242μs (241,708ns)
Component breakdown:
- Serialization (msgpack): 100μs (41%)
- Deserialization: 100μs (41%)
- Decorator overhead: 20μs (8%)
- Key generation: 2μs (1%)
- L1 cache lookup: 0.5μs (0.2%)
- Other (Python interpreter): 20μs (8%)
Validated with 95% CI: [208.5μs, 208.8μs] across 5 runs, 49,548 samples
Total p95: 122μs (121,546ns)
Faster due to smaller serialization overhead. Same component ratios.
With ArrowSerializer:
- Serialize: 0.48ms
- Deserialize: 0.32ms
- Total roundtrip: 0.80ms
- Decorator overhead: ~20μs
- Grand total: ~820μs
With MessagePack (default):
- Serialize: 1.64ms
- Deserialize: 2.32ms
- Total roundtrip: 3.96ms
- Speedup: 5.0x slower than Arrow
Important
Use ArrowSerializer for DataFrames with 10K+ rows (see Serializer Guide).
Pure L1 cache performance (no decorator, direct cache.get() calls):
Total p95: 458ns
Component breakdown:
- Lock acquisition (RLock): 250ns (54.6%)
- TTL check (time.time()): 208ns (45.4%)
- Dict lookup: 125ns (27.3%)
- LRU move (OrderedDict): 125ns (27.3%)
- Counter increment: 125ns (27.3%)
Note
Lock acquisition dominates L1 latency, but it's necessary for thread safety. The 458ns is the practical limit for a thread-safe Python cache.
Scaling characteristics:
- Dict lookup is O(1): 125ns for 1 entry, 125ns for 10,000 entries
- Cache size has zero impact on lookup speed
- Lock contention remains minimal up to 4 concurrent threads (500ns p95)
Mean: 110μs, p95: 160μs
This measures the decorator machinery alone:
- Argument binding and inspection
- Context extraction (thread/async detection)
- Function invocation
- Key generation
Mean: 32μs, p95: 36μs
This is what users experience on cache hits. Overhead breakdown:
- Decorator machinery: ~20μs
- Argument processing: ~10μs
- L1 cache lookup: ~0.5μs
- Return value handling: ~5μs
72x slower than raw L1: The decorator stack adds ~35.5μs on top of the 500ns L1 lookup, but this is still 50-1000x faster than Redis (2-7ms).
Workload: 10 threads hammering the same cache key (worst-case contention)
Results:
- Single-threaded: 242μs p95
- 10 threads: 231μs p95
- Degradation: Essentially none (within measurement noise)
Key takeaway: RLock contention is not a bottleneck for realistic concurrency levels. The L1 cache is designed for high-throughput, multi-threaded applications.
Without encryption: 210μs mean, 228μs p95 With AES-256-GCM: 215μs mean, 236μs p95
Overhead: 1.03x (only 3% slower)
Why so low?
- Encryption happens in Rust (PyO3 FFI)
- AES-NI hardware acceleration on modern CPUs
- Zero-copy memory handling
Security benefit:
- Client-side encryption (no plaintext PII in cache)
- L1 stores encrypted bytes only
- Tenant isolation (per-tenant encryption keys)
See Zero-Knowledge Encryption for details.
| Serializer | Serialize | Deserialize | Total | Speedup |
|---|---|---|---|---|
| Arrow | 0.48ms | 0.32ms | 0.80ms | Baseline |
| MessagePack | 1.64ms | 2.32ms | 3.96ms | 5.0x slower |
| Serializer | Serialize | Deserialize | Total | Speedup |
|---|---|---|---|---|
| Arrow | 2.93ms | 1.13ms | 4.06ms | Baseline |
| MessagePack | 16.42ms | 22.62ms | 39.04ms | 9.6x slower |
Arrow advantages:
- Zero-copy deserialization: Memory-mapped, no full data copy
- Columnar format: Efficient for numeric/datetime columns
- 20x faster deserialization for large DataFrames
MessagePack advantages:
- Broad type support: Handles all Python objects (dicts, lists, custom classes)
- Lower overhead for small data: Faster than Arrow for <1K rows
- Integrated compression: LZ4 + xxHash3-64 checksums (Rust layer)
See Serializer Guide for decision matrix.
Local Redis (localhost):
- Network RTT: 1-2ms
- Total L2 hit latency: 2-5ms (network + deserialization)
Remote Redis (same datacenter):
- Network RTT: 5-15ms
- Total L2 hit latency: 10-30ms
L1 cache value proposition:
- L1 hit: 242μs (0.242ms)
- L2 hit: 2-5ms (local Redis)
- Speedup: 8-20x faster with L1 cache
Async decorator + L1 hit: 192μs mean, 201μs p95
Compared to sync: ~6x faster than sync decorator (which showed 32μs mean in other tests, but this is likely due to measurement differences)
Why async is competitive:
- Same L1 cache path (no await needed for memory lookups)
- Async overhead is minimal for cache hits
- Async benefits show up during cache misses (non-blocking I/O to Redis)
Default configuration already enables L1:
from cachekit import cache
@cache # L1 enabled by default
def expensive_function(user_id: int):
return fetch_user_data(user_id)L1 gives you:
- 8-20x faster than Redis
- Sub-millisecond latency
- No network overhead
For DataFrames (10K+ rows):
from cachekit.serializers import ArrowSerializer
@cache(serializer=ArrowSerializer())
def get_large_dataset(date: str):
return load_dataframe(date) # 5-10x faster serializationFor everything else:
@cache # StandardSerializer (msgpack) is fine
def get_user_config(user_id: int):
return {"settings": {...}, "preferences": {...}}Bad (many small cache hits):
for user_id in user_ids:
data = get_user_data(user_id) # 100 cache hits = 24ms totalGood (one large cache hit):
@cache
def get_users_batch(user_ids: list[int]):
return [fetch_user_data(uid) for uid in user_ids]
data = get_users_batch(user_ids) # 1 cache hit = 242μsShort TTL (high freshness, lower hit rate):
@cache(ttl=60) # 1 minute
def get_real_time_price(symbol: str):
return fetch_current_price(symbol)Long TTL (lower freshness, higher hit rate):
@cache(ttl=86400) # 24 hours
def get_historical_data(symbol: str, date: str):
return fetch_historical(symbol, date) # Immutable dataBuilt-in metrics via Prometheus:
from cachekit.config import DecoratorConfig
from cachekit.config.nested import PrometheusConfig
config = DecoratorConfig(
prometheus=PrometheusConfig(
enabled=True,
port=9090,
namespace="my_app"
)
)
@cache(config=config)
def cached_function():
return expensive_computation()Available metrics:
cache_hits_total: Total cache hitscache_misses_total: Total cache missescache_latency_seconds: Latency histogramcache_serialization_seconds: Serialization timecircuit_breaker_state: Circuit breaker state
See Prometheus Metrics for details.
Problem: MessagePack serialization takes 100μs for a 10KB dict, which is 200x slower than the L1 lookup (500ns).
Mitigations:
- Reduce payload size: Cache only what you need
- Use Arrow for DataFrames: 5-10x faster serialization
- Enable compression: Already enabled by default (Rust layer)
Reality check: Even with serialization overhead, 242μs is still 8-20x faster than Redis.
Problem: Redis L2 hit adds 2-5ms network RTT.
Mitigations:
- L1 cache already handles this: 90%+ of cache hits should come from L1
- Tune L1 size: Increase
max_memory_mbif needed (default: 100MB) - Optimize L1 TTL: Match L1 TTL to data freshness requirements
Problem: Decorator machinery adds 35μs on top of L1 cache.
Mitigations:
- This is acceptable: 35μs is negligible compared to function execution time
- For ultra-low-latency: Use direct
StandardCacheHandlerAPI (bypasses decorator) - Batch queries: Amortize decorator overhead across multiple items
Example (advanced):
from cachekit.cache_handler import StandardCacheHandler
handler = StandardCacheHandler(backend=redis_backend)
# Direct cache access (no decorator overhead)
found, value = handler.get("my_key")
if not found:
value = expensive_function()
handler.put("my_key", value, ttl=3600)Problem: RLock acquisition takes 250ns, which can become a bottleneck at 100+ threads.
Mitigations:
- Most apps don't hit this: Lock contention is minimal up to 10-20 threads
- Shard your cache: Use multiple L1Cache instances with consistent hashing
- Use async: Async decorator avoids blocking on locks
Reality check: Lock overhead (250ns) is 0.1% of total latency (242μs). Not worth optimizing unless you have extreme concurrency (100+ threads).
Baseline targets (fail CI if exceeded):
# tests/performance/test_production_realism.py
assert complex_dict_latency_p95 < 300_000 # 300μs
assert dataclass_latency_p95 < 200_000 # 200μs
assert concurrent_latency_p95 < 500_000 # 500μs
assert encryption_overhead < 3.0 # 3x maxRun regression tests:
uv run pytest tests/performance/ -v -m performanceTypical use case: API response caching
Without cachekit:
- Database query: 50-200ms
- Redis cache hit: 2-5ms
- Best case: 2ms (Redis hit)
With cachekit:
- L1 cache hit: 242μs (0.242ms)
- L2 cache hit: 2-5ms (Redis)
- Cache miss: 50-200ms (database)
With 90% L1 hit rate:
- Average latency: 0.9 × 0.242ms + 0.1 × 2ms = 0.42ms
- Speedup: 4.8x faster than Redis-only caching
Network latency dominates real-world performance. Even a "slow" 1ms cache operation is fast when you consider:
- Typical API response time: 100-500ms
- Database query: 50-200ms
- External API call: 200-1000ms
L1 Cache Component Breakdown:
Lock acquisition: 250ns p95
Dict lookup: 125ns p95
TTL check: 208ns p95
LRU move: 125ns p95
Counter increment: 125ns p95
-----------------------------------
Total (measured): 458ns p95
Decorator Overhead (10KB dict):
Serialization: 100μs (41%)
Deserialization: 100μs (41%)
Decorator machinery: 20μs (8%)
Key generation: 2μs (1%)
L1 lookup: 0.5μs (0.2%)
Other (interpreter): 20μs (8%)
-----------------------------------
Total (measured): 242μs p95
95% CI: [208.5, 208.8]μs
Concurrent Access (10 threads):
Total operations: 10,000
Mean: 844μs
Median: 210μs
P95: 231μs
P99: 284μs
Arrow vs MessagePack (10K rows):
Arrow serialize: 0.48ms
Arrow deserialize: 0.32ms
MessagePack serialize: 1.64ms
MessagePack deserialize: 2.32ms
Serialization speedup: 3.4x
Deserialization speedup: 7.1x
Total speedup: 5.0x
Previous: Data Flow Architecture - Understand the system design Next: Comparison Guide - How cachekit compares to alternatives
- Data Flow Architecture - Component breakdown and latency sources
- Comparison Guide - Performance vs. other libraries
- Configuration Guide - Tuning for your environment
- Serializer Guide - Serialization performance characteristics
- API Reference - All configurable parameters