Skip to content

Latest commit

 

History

History
475 lines (347 loc) · 13.9 KB

File metadata and controls

475 lines (347 loc) · 13.9 KB

HomeArchitecturePerformance

Performance Guide

Sub-millisecond cache operations, measured and benchmarked


Key Numbers

Tip

Key numbers (p95 latency):

  • L1 cache hit: 500ns (pure dict lookup)
  • Decorator + L1 hit: 30-50μs (realistic workload)
  • Complex payload (10KB dict): 242μs with serialization
  • DataFrame (10K rows, Arrow): 800μs total roundtrip
  • Concurrent access (10 threads): 231μs (minimal contention)
  • Encryption overhead: 1.03x (only 3% slower)

Measurement Methodology

All benchmarks use:

  • time.perf_counter_ns(): Nanosecond-precision performance counter
  • Statistical rigor: 5 independent runs, 95% confidence intervals
  • GC filtering: Exclude garbage collection pauses from measurements
  • Warmup: 1,000 iterations before measurement
  • Realistic payloads: 10KB dicts, 10K row DataFrames, custom dataclasses
  • Production configuration: All reliability features enabled (circuit breaker, backpressure, timeouts)

Run benchmarks yourself:

# Component-level profiling
uv run pytest tests/performance/test_cache_profiler.py -v -s

# End-to-end decorator overhead
uv run pytest tests/performance/test_end_to_end_latency.py -v -s

# Production-realistic scenarios
uv run pytest tests/performance/test_production_realism.py -v -s

# Serializer comparison
uv run pytest tests/performance/test_serializer_benchmarks.py -v -s

End-to-End Latency Breakdown

10KB Complex Dict (Typical API Response)

Total p95: 242μs (241,708ns)

Component breakdown:

  • Serialization (msgpack): 100μs (41%)
  • Deserialization: 100μs (41%)
  • Decorator overhead: 20μs (8%)
  • Key generation: 2μs (1%)
  • L1 cache lookup: 0.5μs (0.2%)
  • Other (Python interpreter): 20μs (8%)

Validated with 95% CI: [208.5μs, 208.8μs] across 5 runs, 49,548 samples

User Dataclass (Smaller Payload)

Total p95: 122μs (121,546ns)

Faster due to smaller serialization overhead. Same component ratios.

DataFrame (10K Rows)

With ArrowSerializer:

  • Serialize: 0.48ms
  • Deserialize: 0.32ms
  • Total roundtrip: 0.80ms
  • Decorator overhead: ~20μs
  • Grand total: ~820μs

With MessagePack (default):

  • Serialize: 1.64ms
  • Deserialize: 2.32ms
  • Total roundtrip: 3.96ms
  • Speedup: 5.0x slower than Arrow

Important

Use ArrowSerializer for DataFrames with 10K+ rows (see Serializer Guide).

L1 Cache Component Profiling

Pure L1 cache performance (no decorator, direct cache.get() calls):

Total p95: 458ns

Component breakdown:

  • Lock acquisition (RLock): 250ns (54.6%)
  • TTL check (time.time()): 208ns (45.4%)
  • Dict lookup: 125ns (27.3%)
  • LRU move (OrderedDict): 125ns (27.3%)
  • Counter increment: 125ns (27.3%)

Note

Lock acquisition dominates L1 latency, but it's necessary for thread safety. The 458ns is the practical limit for a thread-safe Python cache.

Scaling characteristics:

  • Dict lookup is O(1): 125ns for 1 entry, 125ns for 10,000 entries
  • Cache size has zero impact on lookup speed
  • Lock contention remains minimal up to 4 concurrent threads (500ns p95)

Decorator Overhead Analysis

Isolated Decorator (No Caching)

Mean: 110μs, p95: 160μs

This measures the decorator machinery alone:

  • Argument binding and inspection
  • Context extraction (thread/async detection)
  • Function invocation
  • Key generation

Decorator + L1 Hit (Hot Path)

Mean: 32μs, p95: 36μs

This is what users experience on cache hits. Overhead breakdown:

  • Decorator machinery: ~20μs
  • Argument processing: ~10μs
  • L1 cache lookup: ~0.5μs
  • Return value handling: ~5μs

72x slower than raw L1: The decorator stack adds ~35.5μs on top of the 500ns L1 lookup, but this is still 50-1000x faster than Redis (2-7ms).

Concurrent Access Performance

Workload: 10 threads hammering the same cache key (worst-case contention)

Results:

  • Single-threaded: 242μs p95
  • 10 threads: 231μs p95
  • Degradation: Essentially none (within measurement noise)

Key takeaway: RLock contention is not a bottleneck for realistic concurrency levels. The L1 cache is designed for high-throughput, multi-threaded applications.

Encryption Overhead

Without encryption: 210μs mean, 228μs p95 With AES-256-GCM: 215μs mean, 236μs p95

Overhead: 1.03x (only 3% slower)

Why so low?

  • Encryption happens in Rust (PyO3 FFI)
  • AES-NI hardware acceleration on modern CPUs
  • Zero-copy memory handling

Security benefit:

  • Client-side encryption (no plaintext PII in cache)
  • L1 stores encrypted bytes only
  • Tenant isolation (per-tenant encryption keys)

See Zero-Knowledge Encryption for details.

Serializer Performance Comparison

DataFrame Serialization (10K rows)

Serializer Serialize Deserialize Total Speedup
Arrow 0.48ms 0.32ms 0.80ms Baseline
MessagePack 1.64ms 2.32ms 3.96ms 5.0x slower

DataFrame Serialization (100K rows)

Serializer Serialize Deserialize Total Speedup
Arrow 2.93ms 1.13ms 4.06ms Baseline
MessagePack 16.42ms 22.62ms 39.04ms 9.6x slower

Arrow advantages:

  • Zero-copy deserialization: Memory-mapped, no full data copy
  • Columnar format: Efficient for numeric/datetime columns
  • 20x faster deserialization for large DataFrames

MessagePack advantages:

  • Broad type support: Handles all Python objects (dicts, lists, custom classes)
  • Lower overhead for small data: Faster than Arrow for <1K rows
  • Integrated compression: LZ4 + xxHash3-64 checksums (Rust layer)

See Serializer Guide for decision matrix.

L2 Backend (Redis) Performance

Local Redis (localhost):

  • Network RTT: 1-2ms
  • Total L2 hit latency: 2-5ms (network + deserialization)

Remote Redis (same datacenter):

  • Network RTT: 5-15ms
  • Total L2 hit latency: 10-30ms

L1 cache value proposition:

  • L1 hit: 242μs (0.242ms)
  • L2 hit: 2-5ms (local Redis)
  • Speedup: 8-20x faster with L1 cache

Async Decorator Performance

Async decorator + L1 hit: 192μs mean, 201μs p95

Compared to sync: ~6x faster than sync decorator (which showed 32μs mean in other tests, but this is likely due to measurement differences)

Why async is competitive:

  • Same L1 cache path (no await needed for memory lookups)
  • Async overhead is minimal for cache hits
  • Async benefits show up during cache misses (non-blocking I/O to Redis)

Performance Optimization Tips

1. Use L1 Cache Aggressively

Default configuration already enables L1:

from cachekit import cache

@cache  # L1 enabled by default
def expensive_function(user_id: int):
    return fetch_user_data(user_id)

L1 gives you:

  • 8-20x faster than Redis
  • Sub-millisecond latency
  • No network overhead

2. Choose the Right Serializer

For DataFrames (10K+ rows):

from cachekit.serializers import ArrowSerializer

@cache(serializer=ArrowSerializer())
def get_large_dataset(date: str):
    return load_dataframe(date)  # 5-10x faster serialization

For everything else:

@cache  # StandardSerializer (msgpack) is fine
def get_user_config(user_id: int):
    return {"settings": {...}, "preferences": {...}}

3. Batch Similar Queries

Bad (many small cache hits):

for user_id in user_ids:
    data = get_user_data(user_id)  # 100 cache hits = 24ms total

Good (one large cache hit):

@cache
def get_users_batch(user_ids: list[int]):
    return [fetch_user_data(uid) for uid in user_ids]

data = get_users_batch(user_ids)  # 1 cache hit = 242μs

4. Tune TTL for Hit Rate

Short TTL (high freshness, lower hit rate):

@cache(ttl=60)  # 1 minute
def get_real_time_price(symbol: str):
    return fetch_current_price(symbol)

Long TTL (lower freshness, higher hit rate):

@cache(ttl=86400)  # 24 hours
def get_historical_data(symbol: str, date: str):
    return fetch_historical(symbol, date)  # Immutable data

5. Monitor Cache Performance

Built-in metrics via Prometheus:

from cachekit.config import DecoratorConfig
from cachekit.config.nested import PrometheusConfig

config = DecoratorConfig(
    prometheus=PrometheusConfig(
        enabled=True,
        port=9090,
        namespace="my_app"
    )
)

@cache(config=config)
def cached_function():
    return expensive_computation()

Available metrics:

  • cache_hits_total: Total cache hits
  • cache_misses_total: Total cache misses
  • cache_latency_seconds: Latency histogram
  • cache_serialization_seconds: Serialization time
  • circuit_breaker_state: Circuit breaker state

See Prometheus Metrics for details.

Performance Bottlenecks and Mitigations

Bottleneck 1: Serialization Dominates Latency (82%)

Problem: MessagePack serialization takes 100μs for a 10KB dict, which is 200x slower than the L1 lookup (500ns).

Mitigations:

  • Reduce payload size: Cache only what you need
  • Use Arrow for DataFrames: 5-10x faster serialization
  • Enable compression: Already enabled by default (Rust layer)

Reality check: Even with serialization overhead, 242μs is still 8-20x faster than Redis.

Bottleneck 2: Network Latency (L2 Cache)

Problem: Redis L2 hit adds 2-5ms network RTT.

Mitigations:

  • L1 cache already handles this: 90%+ of cache hits should come from L1
  • Tune L1 size: Increase max_memory_mb if needed (default: 100MB)
  • Optimize L1 TTL: Match L1 TTL to data freshness requirements

Bottleneck 3: Decorator Overhead (35μs)

Problem: Decorator machinery adds 35μs on top of L1 cache.

Mitigations:

  • This is acceptable: 35μs is negligible compared to function execution time
  • For ultra-low-latency: Use direct StandardCacheHandler API (bypasses decorator)
  • Batch queries: Amortize decorator overhead across multiple items

Example (advanced):

from cachekit.cache_handler import StandardCacheHandler

handler = StandardCacheHandler(backend=redis_backend)

# Direct cache access (no decorator overhead)
found, value = handler.get("my_key")
if not found:
    value = expensive_function()
    handler.put("my_key", value, ttl=3600)

Bottleneck 4: Lock Contention (High Concurrency)

Problem: RLock acquisition takes 250ns, which can become a bottleneck at 100+ threads.

Mitigations:

  • Most apps don't hit this: Lock contention is minimal up to 10-20 threads
  • Shard your cache: Use multiple L1Cache instances with consistent hashing
  • Use async: Async decorator avoids blocking on locks

Reality check: Lock overhead (250ns) is 0.1% of total latency (242μs). Not worth optimizing unless you have extreme concurrency (100+ threads).

Performance Regression Testing

Baseline targets (fail CI if exceeded):

# tests/performance/test_production_realism.py
assert complex_dict_latency_p95 < 300_000  # 300μs
assert dataclass_latency_p95 < 200_000     # 200μs
assert concurrent_latency_p95 < 500_000    # 500μs
assert encryption_overhead < 3.0           # 3x max

Run regression tests:

uv run pytest tests/performance/ -v -m performance

Real-World Performance Context

Typical use case: API response caching

Without cachekit:

  • Database query: 50-200ms
  • Redis cache hit: 2-5ms
  • Best case: 2ms (Redis hit)

With cachekit:

  • L1 cache hit: 242μs (0.242ms)
  • L2 cache hit: 2-5ms (Redis)
  • Cache miss: 50-200ms (database)

With 90% L1 hit rate:

  • Average latency: 0.9 × 0.242ms + 0.1 × 2ms = 0.42ms
  • Speedup: 4.8x faster than Redis-only caching

Network latency dominates real-world performance. Even a "slow" 1ms cache operation is fast when you consider:

  • Typical API response time: 100-500ms
  • Database query: 50-200ms
  • External API call: 200-1000ms

Appendix: Raw Benchmark Data

L1 Cache Component Breakdown:

Lock acquisition:        250ns p95
Dict lookup:            125ns p95
TTL check:              208ns p95
LRU move:               125ns p95
Counter increment:      125ns p95
-----------------------------------
Total (measured):       458ns p95

Decorator Overhead (10KB dict):

Serialization:          100μs (41%)
Deserialization:        100μs (41%)
Decorator machinery:     20μs (8%)
Key generation:           2μs (1%)
L1 lookup:              0.5μs (0.2%)
Other (interpreter):     20μs (8%)
-----------------------------------
Total (measured):       242μs p95
95% CI:                 [208.5, 208.8]μs

Concurrent Access (10 threads):

Total operations:       10,000
Mean:                   844μs
Median:                 210μs
P95:                    231μs
P99:                    284μs

Arrow vs MessagePack (10K rows):

Arrow serialize:        0.48ms
Arrow deserialize:      0.32ms
MessagePack serialize:  1.64ms
MessagePack deserialize: 2.32ms

Serialization speedup:  3.4x
Deserialization speedup: 7.1x
Total speedup:          5.0x

Next Steps

Previous: Data Flow Architecture - Understand the system design Next: Comparison Guide - How cachekit compares to alternatives

See Also