Summary
PlatformRouter(platforms={...}) requires every per-tenant DecisioningPlatform to be eagerly constructed at boot. For adopters with N tenants × per-tenant SDK auth handshakes (GAM service-account auth, Kevel API key handshake, etc.), boot time scales O(N) and adding/removing tenants requires a process restart.
A lazy variant resolves the platform on first request per tenant, caches the result, and supports invalidation for hot-reload.
Concrete shape (answering @bokelley's pre-implementation question)
Async factory + SDK-owned bounded LRU. Mirrors the pattern just landed in CallableSubdomainTenantRouter (PR #544): adopter writes a single async callable, framework owns the cache with sane bounded defaults, adopter overrides bounds when their N is unusual.
from adcp.decisioning import LazyPlatformRouter
async def build_platform(tenant_id: str) -> DecisioningPlatform:
cfg = await load_tenant_config(tenant_id)
if cfg.adapter == "google_ad_manager":
return WonderstruckGamPlatform(cfg)
elif cfg.adapter == "kevel":
return KevelPlatform(cfg)
return MockSellerPlatform(cfg)
router = LazyPlatformRouter(
accounts=...,
factory=build_platform,
capabilities=...,
cache_size=256, # default; adopters with more tenants override
cache_ttl_seconds=3600.0, # default 1 hour; 0 = no expiry (still bounded by size)
)
Why SDK-owned cache (not pluggable):
- The dominant footgun is forgetting to bound the cache, leaking adapters as tenants churn. The SDK enforces bounds; adopters can't accidentally ship an unbounded
lru_cache(maxsize=None) knock-off.
- The pluggable cache surface is real complexity for a real-but-rare need (e.g. cross-process platform sharing, which today's
PlatformRouter doesn't support either).
- If a meaningful adopter need surfaces, add
cache: PlatformCache | None = None later — the bounded-LRU default stays the same.
- TTL is allowed to be 0 (size-only eviction) since per-tenant adapters are typically expensive to rebuild and there's no equivalent "stale tenant data" footgun (
CallableSubdomainTenantRouter rejected ttl=0 because tenants go stale; platform adapters don't).
Why async factory:
Building per-tenant adapters often involves network I/O — GAM auth handshake (~50-200ms), Kevel API ping, signed-request handshake against a remote endpoint. Forcing sync would either block the boot path (eager) or block the request thread (lazy + sync), neither of which is acceptable for the "looks fast on the wire" property.
Invalidation:
router.invalidate(tenant_id) # specific tenant — tenant config rotated, force rebuild
router.invalidate() # all platforms — useful for ops "drop everything" debugging
Behavior on invalidate() of an in-flight request: the request that already grabbed the platform reference completes normally (caller holds the ref); the next request gets a fresh build. No request cancellation. Matches CallableSubdomainTenantRouter's contract.
Memory profile
- Cache holds up to
cache_size DecisioningPlatform instances.
- Each instance's memory profile is adopter-defined (an instance with a cached
httpx.AsyncClient could hold a connection pool).
- Bounded LRU + TTL means platforms for inactive tenants get released; adopter
__del__ / aclose() runs through normal GC.
- Critical for the salesagent slow-leak triage: the alternative (today's eager
PlatformRouter) holds every platform forever — cache here is strictly an improvement.
Thundering herd note
First request for a cold tenant builds the platform. If two concurrent requests for the same tenant hit a cold cache, both await the factory; asyncio cooperative scheduling means no corruption (last-write wins, both refs are equivalent), but the auth handshake runs 2x. Singleflight (one-build-per-tenant under contention) is overkill for v1; flag if adopters report DB-pressure / API-rate-limit spikes.
Drop-in shape with existing PlatformRouter
LazyPlatformRouter IS a DecisioningPlatform (same Protocol satisfaction as today's eager PlatformRouter). serve() accepts either. Adopters migrate by swapping the constructor; no changes to accounts, capabilities, serve() wiring.
What this unblocks
salesagent's core/main.py::_load_platforms() currently iterates the active tenants table and builds every platform at boot — for the Wonderstruck demo (2 tenants) that's fine, but for production-shaped deployments with 50-500 tenants the boot cost compounds. Hot-add of a new tenant requires a restart today; with the lazy router, the new tenant works on first request after router.invalidate() (or naturally after TTL).
Working pattern reference is in salesagent's core/main.py if useful for the implementation cross-reference.
Memory-leak lens
The salesagent production memory leak (linear ramp to ~12 GB ceiling, ~3-4 day OOM cycle) makes per-tenant resource tracking a current priority. LazyPlatformRouter with bounded cache is strictly safer than today's eager router for that profile — ship it bounded by default.
🤖 Filed via Claude Code
Summary
PlatformRouter(platforms={...})requires every per-tenantDecisioningPlatformto be eagerly constructed at boot. For adopters with N tenants × per-tenant SDK auth handshakes (GAM service-account auth, Kevel API key handshake, etc.), boot time scales O(N) and adding/removing tenants requires a process restart.A lazy variant resolves the platform on first request per tenant, caches the result, and supports invalidation for hot-reload.
Concrete shape (answering @bokelley's pre-implementation question)
Async factory + SDK-owned bounded LRU. Mirrors the pattern just landed in
CallableSubdomainTenantRouter(PR #544): adopter writes a single async callable, framework owns the cache with sane bounded defaults, adopter overrides bounds when their N is unusual.Why SDK-owned cache (not pluggable):
lru_cache(maxsize=None)knock-off.PlatformRouterdoesn't support either).cache: PlatformCache | None = Nonelater — the bounded-LRU default stays the same.CallableSubdomainTenantRouterrejectedttl=0because tenants go stale; platform adapters don't).Why async factory:
Building per-tenant adapters often involves network I/O — GAM auth handshake (~50-200ms), Kevel API ping, signed-request handshake against a remote endpoint. Forcing sync would either block the boot path (eager) or block the request thread (lazy + sync), neither of which is acceptable for the "looks fast on the wire" property.
Invalidation:
Behavior on
invalidate()of an in-flight request: the request that already grabbed the platform reference completes normally (caller holds the ref); the next request gets a fresh build. No request cancellation. MatchesCallableSubdomainTenantRouter's contract.Memory profile
cache_sizeDecisioningPlatforminstances.httpx.AsyncClientcould hold a connection pool).__del__/aclose()runs through normal GC.PlatformRouter) holds every platform forever — cache here is strictly an improvement.Thundering herd note
First request for a cold tenant builds the platform. If two concurrent requests for the same tenant hit a cold cache, both await the factory; asyncio cooperative scheduling means no corruption (last-write wins, both refs are equivalent), but the auth handshake runs 2x. Singleflight (one-build-per-tenant under contention) is overkill for v1; flag if adopters report DB-pressure / API-rate-limit spikes.
Drop-in shape with existing PlatformRouter
LazyPlatformRouterIS aDecisioningPlatform(same Protocol satisfaction as today's eagerPlatformRouter).serve()accepts either. Adopters migrate by swapping the constructor; no changes toaccounts,capabilities,serve()wiring.What this unblocks
salesagent's
core/main.py::_load_platforms()currently iterates the active tenants table and builds every platform at boot — for the Wonderstruck demo (2 tenants) that's fine, but for production-shaped deployments with 50-500 tenants the boot cost compounds. Hot-add of a new tenant requires a restart today; with the lazy router, the new tenant works on first request afterrouter.invalidate()(or naturally after TTL).Working pattern reference is in salesagent's
core/main.pyif useful for the implementation cross-reference.Memory-leak lens
The salesagent production memory leak (linear ramp to ~12 GB ceiling, ~3-4 day OOM cycle) makes per-tenant resource tracking a current priority.
LazyPlatformRouterwith bounded cache is strictly safer than today's eager router for that profile — ship it bounded by default.🤖 Filed via Claude Code