Skip to content

feat(decisioning): WebhookDeliverySupervisor + SQLAlchemy A2A stores example#348

Merged
bokelley merged 2 commits intomainfrom
bokelley/webhook-supervisor
May 2, 2026
Merged

feat(decisioning): WebhookDeliverySupervisor + SQLAlchemy A2A stores example#348
bokelley merged 2 commits intomainfrom
bokelley/webhook-supervisor

Conversation

@bokelley
Copy link
Copy Markdown
Contributor

@bokelley bokelley commented May 2, 2026

Summary

Round-2 P1: Protocol seam for webhook delivery reliability. The existing `WebhookSender` is the transport (HTTP-Signatures POST, one attempt, no retry); production sellers wrap it with retry / circuit breaker / per-attempt audit. This adds the SDK-side seam so adopters compose existing primitives instead of writing ~1,000 LOC of reliability themselves.

Plus a SQLAlchemy A2A stores example (companion to the existing SQLite reference) so adopters with existing SQLAlchemy schemas wrap their models behind `TaskStore` + `PushNotificationConfigStore`.

What's new

`src/adcp/webhook_supervisor.py` (522 LOC):

  • `WebhookDeliverySupervisor` Protocol — async `send_mcp` surface. Adopters with infra-side retry (Celery, Kafka, durable outbox) implement against their queue.
  • `InMemoryWebhookDeliverySupervisor` — reference impl. Wraps a `WebhookSender` with:
    • Per-endpoint `CircuitBreaker` (CLOSED / OPEN / HALF_OPEN state machine, 5-failure threshold, 60s recovery, 2-success close)
    • `RetryPolicy` (exponential backoff with jitter, max 3 attempts, 30s cap)
    • Per-`sequence_key` monotonic counter for delivery-report sequence numbers
    • Optional `DeliveryLogSink` Protocol for BYO persistence
  • `DeliveryAttempt` frozen dataclass — one record per attempt (success / failure / circuit_open) for audit-log persistence
  • `DeliveryLogSink` Protocol — adopters wire to their existing `webhook_delivery_log` tables; sink failures swallowed (broken sink must not cascade into delivery loss)

Wire-through

  • `PlatformHandler` accepts `webhook_supervisor=`; F12 auto-emit routes through supervisor when configured, falls back to bare sender otherwise. Backward-compat: existing `webhook_sender=` path unchanged.
  • `serve()` + `create_adcp_server_from_platform()` forward the new param.
  • Boot-time `validate_webhook_sender_for_platform` now accepts either a sender or a supervisor (missing-sender error code: `webhook_sender_or_supervisor`).

SQLAlchemy A2A stores example

`examples/a2a_sqlalchemy_tasks.py` (450 LOC) — companion to `examples/a2a_db_tasks.py` (raw-SQLite reference). Same Protocol surface backed by SQLAlchemy ORM so the same code runs against any backend SQLA supports — SQLite for the demo, Postgres / MySQL in production. Salesagent and other SQLAlchemy-based sellers wrap their existing models behind the Protocols.

Salesagent impact

  • `webhook_delivery_service.py` (567 LOC) → wraps existing HMAC sender with `InMemoryWebhookDeliverySupervisor`, plug `WebhookDeliveryLog` table into `DeliveryLogSink`. Net delete: ~500 LOC.
  • `protocol_webhook_service.py` (474 LOC) → similarly composes against the Protocol.
  • A2A stores: their `Task` + `PushNotificationConfig` models wrap behind the SQLAlchemy reference's pattern (~50 LOC adopter code, replacing 200+ LOC of bespoke A2A-store wiring).

Test plan

  • `pytest tests/test_webhook_supervisor.py` — 22 new tests pass
  • `pytest tests/test_decisioning_serve.py tests/test_decisioning_webhook_emit.py` — 57 affected tests pass (existing tests updated for supervisor parameter)
  • Full suite passes
  • `ruff check` clean
  • `mypy` clean

Cross-links

🤖 Generated with Claude Code

bokelley and others added 2 commits May 2, 2026 13:02
…res example

Round-2 P1 reliability layer for F12 sync-completion webhooks. The
existing WebhookSender is the transport (HTTP-Signatures POST, one
attempt, no retry); production sellers wrap it with retry, circuit
breaker, and per-attempt audit. This adds the SDK-side seam.

Tracks adopter feedback that salesagent's webhook reliability layer
(~1,041 LOC across webhook_delivery_service.py + protocol_webhook_service.py)
has no SDK home. After this PR, those LOC compose against the Protocol
seam instead of being adopter-rolled.

Components in src/adcp/webhook_supervisor.py:

* WebhookDeliverySupervisor Protocol — async send_mcp surface,
  Protocol-conformant; adopters with infra-side retry (Celery, Kafka,
  durable outbox) implement against their queue.
* InMemoryWebhookDeliverySupervisor reference impl — wraps a
  WebhookSender with per-endpoint CircuitBreaker (CLOSED/OPEN/HALF_OPEN
  state machine, 5-failure threshold, 60s recovery), exponential-
  backoff RetryPolicy with jitter, per-sequence_key monotonic counter
  for delivery-report sequence numbers, optional DeliveryLogSink
  Protocol for BYO persistence.
* DeliveryAttempt frozen dataclass — one record per attempt
  (success / failure / circuit_open) for audit-log persistence.
* DeliveryLogSink Protocol — adopters wire sinks to their existing
  webhook_delivery_log tables; sink failures swallowed (broken sink
  must not cascade into delivery loss).

Wire-through:

* PlatformHandler accepts webhook_supervisor=; F12 auto-emit routes
  through supervisor when configured, falls back to bare sender
  otherwise. Backward-compat: existing webhook_sender= path unchanged.
* serve() and create_adcp_server_from_platform() forward the new
  param. Boot-time validate_webhook_sender_for_platform now accepts
  either a sender or a supervisor (changed missing-sender error code
  from "webhook_sender" to "webhook_sender_or_supervisor").

A2A stores example: examples/a2a_sqlalchemy_tasks.py. Companion to
the SQLite reference (examples/a2a_db_tasks.py). Same Protocol
surface (TaskStore + PushNotificationConfigStore) backed by
SQLAlchemy ORM so the same code runs against any backend SQLA
supports — SQLite for the demo, Postgres / MySQL in production.
Salesagent and other SQLAlchemy-based sellers wrap their existing
models behind the Protocols.

Tests: 22 new in test_webhook_supervisor.py (retry math, circuit
breaker state machine, supervisor success/retry/exception paths,
sink failure isolation, F12 wire-through, boot-validation).
Existing F12 + serve tests updated for the supervisor parameter.
79 affected tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…aker_key, sink timeout

Ten findings from four expert reviews on the WebhookDeliverySupervisor:

**Critical (spec compliance):**
- Idempotency-key reuse on retry — spec requires reusing the same key
  on every retry. Refactored to use sender.resend(last_result) for
  attempts 2+; only attempt 1 (or after exception with no result) calls
  send_mcp fresh.

**High:**
- Cross-tenant breaker collision — added optional breaker_key parameter;
  multi-tenant sellers scope via f"{tenant_id}:{url}".
- Sink unbounded on hot path — wrapped in asyncio.wait_for with
  configurable RetryPolicy.sink_timeout_seconds (default 5s).
- Sequence number burned on circuit-open — allocated only after
  can_attempt() returns True.
- InMemoryWebhookDeliverySupervisor.__init__ now raises ValueError when
  sender is None (preserves F12 boot fail-fast).
- UTC = timezone.utc moved below all imports for ruff isort compliance.

**Medium:**
- response_time_ms switched from datetime deltas to time.monotonic().
- record_success while OPEN now warns + transitions to HALF_OPEN with
  success_count=1 (was silently flipping to CLOSED).
- DeliveryAttempt.notification_type new optional field for delivery
  reports parity with salesagent's WebhookDeliveryLog.
- sequence_key docstring clarified per-stream recommendation
  (f"{media_buy_id}:{url}").

Tests: 30 (up from 22). 8 new tests cover idempotency-key reuse via
resend, breaker_key tenant isolation, sequence-number no-burn on
circuit-open, sink timeout, notification_type passthrough, monotonic
clock, init-time None-sender rejection, sender-only backward-compat.

2,990 total tests pass. ruff + mypy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bokelley
Copy link
Copy Markdown
Contributor Author

bokelley commented May 2, 2026

Expert review feedback addressed (commit d267fb3)

Four reviews ran (code-reviewer, security-reviewer, ad-tech-protocol-expert, python-expert). Ten findings folded in.

Critical (spec compliance):

  • Idempotency-key reuse on retry — spec mcp-webhook-payload.json:9 requires publishers reuse the same key on every retry. Refactored retry loop to use sender.resend(last_result) for attempts 2+ (replays exact bytes including idempotency_key). Only attempt 1, or attempts after a predecessor that raised before producing bytes, call send_mcp fresh.

High:

  • Cross-tenant breaker collision (review M2) — added breaker_key param to send_mcp. Multi-tenant sellers whose buyers share a SaaS URL (Zapier, Make, etc.) scope via f"{tenant_id}:{url}".
  • Sink unbounded on hot path (review M1) — wrapped in asyncio.wait_for with configurable RetryPolicy.sink_timeout_seconds (default 5s). Misbehaving sink can no longer freeze the supervisor.
  • Sequence number burned on circuit-open — allocated only after can_attempt() returns True. No gaps in the buyer-facing stream from circuit-open skips.
  • Boot validator: supervisor wrapping None senderInMemoryWebhookDeliverySupervisor.__init__ now raises ValueError. Preserves F12 boot fail-fast.
  • UTC = timezone.utc placement — moved below all imports (was wedged mid-import-block, breaking ruff isort).

Medium:

  • response_time_ms uses time.monotonic() — NTP-step-resilient.
  • record_success while OPEN — was silently flipping to CLOSED; now logs WARNING + transitions to HALF_OPEN with success_count=1 (single racy success isn't enough to declare healthy).
  • DeliveryAttempt.notification_type — new optional passthrough for delivery reports (scheduled / final / adjusted / delayed / window_update). Parity with salesagent's WebhookDeliveryLog.
  • sequence_key docstring — clarified per-stream recommendation: f"{media_buy_id}:{url}" so multiple subscribers per media buy each get monotonic numbering.

Tests: 30 (up from 22). 8 new covering: idempotency-key reuse via resend, breaker_key tenant isolation + URL fallback, sequence-number no-burn on circuit-open, sink timeout bound, notification_type passthrough, monotonic clock, init-time None-sender rejection, sender-only backward-compat boot validation.

Verification:

  • 30 supervisor tests pass.
  • 2,990 total tests pass, 0 regressions.
  • ruff + mypy clean.

Findings deferred (low/info, not merge-blockers):

  • Async-prefix naming convention (InMemory* vs AsyncInMemory*) — consistent with existing InMemoryTaskRegistry etc., per code-reviewer's "consistent with adjacent files."
  • HALF_OPEN allows one test attempt at a time docstring updated to note no in-flight counter (fix #L1).
  • supervisor_or_sender helper removed unused import comment (was unused; webhook_emit inlines the precedence).
  • examples/a2a_sqlalchemy_tasks.py docstring filename inconsistency — left for follow-up.
  • Sequence-number cross-process collision warning surfaced once at first allocation (review L1) ✅ added.
  • Buyer-supplied response bytes in error_message documented in DeliveryAttempt docstring (review L2) ✅ added.

@bokelley
Copy link
Copy Markdown
Contributor Author

bokelley commented May 2, 2026

Thanks for the thorough write-up. All 10 findings addressed, test count up from 22 → 30, 2,990 total passing, ruff + mypy clean — looks ready for human review.


Generated by Claude Code

@bokelley
Copy link
Copy Markdown
Contributor Author

bokelley commented May 2, 2026

Acknowledged — good to see all 10 findings resolved and test coverage up to 30 in this suite (2,990 total passing). Handing off to human review.


Triaged by Claude Code. Session: https://claude.ai/code/session_012oVqtFW56c8kWh7HvU8SzY


Generated by Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant