feat: LazyPlatformRouter for tenant-on-first-request platform construction

## Summary

`PlatformRouter(platforms={...})` requires every per-tenant `DecisioningPlatform` to be eagerly constructed at boot. For adopters with N tenants × per-tenant SDK auth handshakes (GAM service-account auth, Kevel API key handshake, etc.), boot time scales O(N) and adding/removing tenants requires a process restart.

A lazy variant resolves the platform on first request per tenant, caches the result, and supports invalidation for hot-reload.

## Concrete shape (answering @bokelley's pre-implementation question)

**Async factory + SDK-owned bounded LRU.** Mirrors the pattern just landed in `CallableSubdomainTenantRouter` (PR #544): adopter writes a single async callable, framework owns the cache with sane bounded defaults, adopter overrides bounds when their N is unusual.

```python
from adcp.decisioning import LazyPlatformRouter

async def build_platform(tenant_id: str) -> DecisioningPlatform:
    cfg = await load_tenant_config(tenant_id)
    if cfg.adapter == "google_ad_manager":
        return WonderstruckGamPlatform(cfg)
    elif cfg.adapter == "kevel":
        return KevelPlatform(cfg)
    return MockSellerPlatform(cfg)

router = LazyPlatformRouter(
    accounts=...,
    factory=build_platform,
    capabilities=...,
    cache_size=256,             # default; adopters with more tenants override
    cache_ttl_seconds=3600.0,   # default 1 hour; 0 = no expiry (still bounded by size)
)
```

**Why SDK-owned cache (not pluggable):**

- The dominant footgun is *forgetting* to bound the cache, leaking adapters as tenants churn. The SDK enforces bounds; adopters can't accidentally ship an unbounded `lru_cache(maxsize=None)` knock-off.
- The pluggable cache surface is real complexity for a real-but-rare need (e.g. cross-process platform sharing, which today's `PlatformRouter` doesn't support either).
- If a meaningful adopter need surfaces, add `cache: PlatformCache | None = None` later — the bounded-LRU default stays the same.
- TTL is allowed to be 0 (size-only eviction) since per-tenant adapters are typically expensive to rebuild and there's no equivalent "stale tenant data" footgun (`CallableSubdomainTenantRouter` rejected `ttl=0` because *tenants* go stale; *platform adapters* don't).

**Why async factory:**

Building per-tenant adapters often involves network I/O — GAM auth handshake (~50-200ms), Kevel API ping, signed-request handshake against a remote endpoint. Forcing sync would either block the boot path (eager) or block the request thread (lazy + sync), neither of which is acceptable for the "looks fast on the wire" property.

**Invalidation:**

```python
router.invalidate(tenant_id)   # specific tenant — tenant config rotated, force rebuild
router.invalidate()            # all platforms — useful for ops "drop everything" debugging
```

**Behavior on `invalidate()` of an in-flight request:** the request that already grabbed the platform reference completes normally (caller holds the ref); the next request gets a fresh build. No request cancellation. Matches `CallableSubdomainTenantRouter`'s contract.

## Memory profile

- Cache holds up to `cache_size` `DecisioningPlatform` instances.
- Each instance's memory profile is adopter-defined (an instance with a cached `httpx.AsyncClient` could hold a connection pool).
- Bounded LRU + TTL means platforms for inactive tenants get released; adopter `__del__` / `aclose()` runs through normal GC.
- Critical for the salesagent slow-leak triage: the alternative (today's eager `PlatformRouter`) holds *every* platform forever — cache here is strictly an improvement.

## Thundering herd note

First request for a cold tenant builds the platform. If two concurrent requests for the same tenant hit a cold cache, both await the factory; asyncio cooperative scheduling means no corruption (last-write wins, both refs are equivalent), but the auth handshake runs 2x. Singleflight (one-build-per-tenant under contention) is overkill for v1; flag if adopters report DB-pressure / API-rate-limit spikes.

## Drop-in shape with existing PlatformRouter

`LazyPlatformRouter` IS a `DecisioningPlatform` (same Protocol satisfaction as today's eager `PlatformRouter`). `serve()` accepts either. Adopters migrate by swapping the constructor; no changes to `accounts`, `capabilities`, `serve()` wiring.

## What this unblocks

salesagent's `core/main.py::_load_platforms()` currently iterates the active tenants table and builds every platform at boot — for the Wonderstruck demo (2 tenants) that's fine, but for production-shaped deployments with 50-500 tenants the boot cost compounds. Hot-add of a new tenant requires a restart today; with the lazy router, the new tenant works on first request after `router.invalidate()` (or naturally after TTL).

Working pattern reference is in salesagent's `core/main.py` if useful for the implementation cross-reference.

## Memory-leak lens

The salesagent production memory leak (linear ramp to ~12 GB ceiling, ~3-4 day OOM cycle) makes per-tenant resource tracking a current priority. `LazyPlatformRouter` with bounded cache is strictly safer than today's eager router for that profile — ship it bounded by default.

🤖 Filed via Claude Code

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: LazyPlatformRouter for tenant-on-first-request platform construction #547

Summary

Concrete shape (answering @bokelley's pre-implementation question)

Memory profile

Thundering herd note

Drop-in shape with existing PlatformRouter

What this unblocks

Memory-leak lens

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: LazyPlatformRouter for tenant-on-first-request platform construction #547

Description

Summary

Concrete shape (answering @bokelley's pre-implementation question)

Memory profile

Thundering herd note

Drop-in shape with existing PlatformRouter

What this unblocks

Memory-leak lens

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions