Skip to content

feat: LazyPlatformRouter for tenant-on-first-request platform construction #547

@bokelley

Description

@bokelley

Summary

PlatformRouter(platforms={...}) requires every per-tenant DecisioningPlatform to be eagerly constructed at boot. For adopters with N tenants × per-tenant SDK auth handshakes (GAM service-account auth, Kevel API key handshake, etc.), boot time scales O(N) and adding/removing tenants requires a process restart.

A lazy variant resolves the platform on first request per tenant, caches the result, and supports invalidation for hot-reload.

Concrete shape (answering @bokelley's pre-implementation question)

Async factory + SDK-owned bounded LRU. Mirrors the pattern just landed in CallableSubdomainTenantRouter (PR #544): adopter writes a single async callable, framework owns the cache with sane bounded defaults, adopter overrides bounds when their N is unusual.

from adcp.decisioning import LazyPlatformRouter

async def build_platform(tenant_id: str) -> DecisioningPlatform:
    cfg = await load_tenant_config(tenant_id)
    if cfg.adapter == "google_ad_manager":
        return WonderstruckGamPlatform(cfg)
    elif cfg.adapter == "kevel":
        return KevelPlatform(cfg)
    return MockSellerPlatform(cfg)

router = LazyPlatformRouter(
    accounts=...,
    factory=build_platform,
    capabilities=...,
    cache_size=256,             # default; adopters with more tenants override
    cache_ttl_seconds=3600.0,   # default 1 hour; 0 = no expiry (still bounded by size)
)

Why SDK-owned cache (not pluggable):

  • The dominant footgun is forgetting to bound the cache, leaking adapters as tenants churn. The SDK enforces bounds; adopters can't accidentally ship an unbounded lru_cache(maxsize=None) knock-off.
  • The pluggable cache surface is real complexity for a real-but-rare need (e.g. cross-process platform sharing, which today's PlatformRouter doesn't support either).
  • If a meaningful adopter need surfaces, add cache: PlatformCache | None = None later — the bounded-LRU default stays the same.
  • TTL is allowed to be 0 (size-only eviction) since per-tenant adapters are typically expensive to rebuild and there's no equivalent "stale tenant data" footgun (CallableSubdomainTenantRouter rejected ttl=0 because tenants go stale; platform adapters don't).

Why async factory:

Building per-tenant adapters often involves network I/O — GAM auth handshake (~50-200ms), Kevel API ping, signed-request handshake against a remote endpoint. Forcing sync would either block the boot path (eager) or block the request thread (lazy + sync), neither of which is acceptable for the "looks fast on the wire" property.

Invalidation:

router.invalidate(tenant_id)   # specific tenant — tenant config rotated, force rebuild
router.invalidate()            # all platforms — useful for ops "drop everything" debugging

Behavior on invalidate() of an in-flight request: the request that already grabbed the platform reference completes normally (caller holds the ref); the next request gets a fresh build. No request cancellation. Matches CallableSubdomainTenantRouter's contract.

Memory profile

  • Cache holds up to cache_size DecisioningPlatform instances.
  • Each instance's memory profile is adopter-defined (an instance with a cached httpx.AsyncClient could hold a connection pool).
  • Bounded LRU + TTL means platforms for inactive tenants get released; adopter __del__ / aclose() runs through normal GC.
  • Critical for the salesagent slow-leak triage: the alternative (today's eager PlatformRouter) holds every platform forever — cache here is strictly an improvement.

Thundering herd note

First request for a cold tenant builds the platform. If two concurrent requests for the same tenant hit a cold cache, both await the factory; asyncio cooperative scheduling means no corruption (last-write wins, both refs are equivalent), but the auth handshake runs 2x. Singleflight (one-build-per-tenant under contention) is overkill for v1; flag if adopters report DB-pressure / API-rate-limit spikes.

Drop-in shape with existing PlatformRouter

LazyPlatformRouter IS a DecisioningPlatform (same Protocol satisfaction as today's eager PlatformRouter). serve() accepts either. Adopters migrate by swapping the constructor; no changes to accounts, capabilities, serve() wiring.

What this unblocks

salesagent's core/main.py::_load_platforms() currently iterates the active tenants table and builds every platform at boot — for the Wonderstruck demo (2 tenants) that's fine, but for production-shaped deployments with 50-500 tenants the boot cost compounds. Hot-add of a new tenant requires a restart today; with the lazy router, the new tenant works on first request after router.invalidate() (or naturally after TTL).

Working pattern reference is in salesagent's core/main.py if useful for the implementation cross-reference.

Memory-leak lens

The salesagent production memory leak (linear ramp to ~12 GB ceiling, ~3-4 day OOM cycle) makes per-tenant resource tracking a current priority. LazyPlatformRouter with bounded cache is strictly safer than today's eager router for that profile — ship it bounded by default.

🤖 Filed via Claude Code

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions