Skip to content

Make the SDK fast for 1-agent use; adapt to multi-agent automatically #34

@aural-psynapse

Description

@aural-psynapse

The SDK was built for two-cluster handoff (A records HTTP, B verifies via proofs). Real usage today — customer-support-sdk-demo, anything Claude-Code-shaped — is one agent doing everything, and that flow pays multi-agent ceremony on every tool call. This proposal removes the ceremony and speeds up multi-agent at the same time. Same code path, no mode flag, no detection logic.


3 tool calls, 3 claims (two of them hit the same intercept row):

NOW  —  every backend round-trip is sequential

  agent  ─► tool 1 ──► [ preprocess poll ]
  agent  ─► tool 2 ──► [ preprocess poll ]
  agent  ─► tool 3 ──► [ preprocess poll ]
  build  ─► QR claim 1
  build  ─► QR claim 2   (duplicates QR1 — same intercept row)
  build  ─► QR claim 3
  eval   ─► fetch claim 1
  eval   ─► fetch claim 2
  eval   ─► fetch claim 3
  eval   ─► verify QR1
  eval   ─► verify QR2
  eval   ─► verify QR3
  ──────────────────────────────────────────────────────────►  time
PROPOSED  —  preprocess coalesced, claims deduped, the rest parallel

  agent  ─► tool 1, tool 2, tool 3                (no preprocess wait)
  worker ─►   [ 1 preprocess, debounced ]
  build  ─►   [ QR1, QR2  in parallel ]           (3 claims → 2 records)
  eval   ─►   [ fetch1, fetch2, fetch3  in parallel ]
  eval   ─►   [ verify1, verify2  in parallel ]
  ──────────────────────────────────────────────────────────►  time

For K claims over M unique intercept rows:

Stage Now Proposed
Tool-call latency K × preprocess wait ~0 (worker is async)
Preprocess runs K (one per call) 1 (debounced)
Query records made K (one per claim) M (deduped by row)
Evaluator fetches K sequential K parallel (≤8 at once)
Evaluator verifies M sequential M parallel (≤8 at once)

K=10, M=5 → 10 preprocesses become 1, 10 query records become 5, 20 sequential evaluator round-trips become 2 parallel batches.


What's slow today

Walking through the same 3-tool-call, 3-claim run:

  1. Each tool call blocks while preprocess runs end-to-end. _storage.py:170 synchronously kicks preprocess after every intercept and polls until done. 3 calls = 3 sequential preprocess runs blocking the agent.
  2. Each claim's query record is created serially. payload_builder.py:141-162 loops claims doing POST /query → POST /generate_proof → poll, one at a time. K claims = K sequential round-trips, even when claims duplicate.
  3. The evaluator does the same. evaluator.py:113-133 fetches each claim's record serially; :183-209 verifies each unique query_record_id serially.
  4. set_interceptor_context is mandatory and easy to get wrong. Interceptor default "unknown" (interceptor.py:238) doesn't match payload-builder default "fetch_and_claim" (payload_builder.py:30) — forget the wrap and the lookup silently misses with an empty payload.
  5. Bootstrap always runs preprocess. client.py:50 runs it even on padding-only tables.
  6. Polling floors are too high. _preprocess.py:102/117 poll every 0.3s / 0.1s — most preprocesses finish faster than the floor.

The fix

One worker thread coalesces preprocess. Replace the synchronous per-intercept call in _storage.py:170 with a "dirty" flag picked up by a debounced background worker. Worker runs preprocess once per 50 ms window. _build_claims and evaluate_handoff block on a condition variable until the proof catches up to their snapshot (SELECT MAX(id) FROM provably_intercepts).

This is the whole single-vs-multi-agent story in one mechanism:

  • 1 agent, 10 sequential calls → 1 preprocess (was 10)
  • N agents interleaving → still 1 worker; each agent's evaluate waits for its own snapshot

Dedupe per-intercept query records. Today payload_builder.py:141-162 creates a query record per claim, even when several claims target the same intercept row. Group claims by SQL signature (row_id when the interceptor recorded one, else the fallback WHERE action_name = '...' at _query_records.py:83-88) before creating; share the resulting query_record_id across the group. K claims with M unique signatures → M query records instead of K. evaluator.py:183-209 already dedupes by query_record_id, so this falls out for free downstream.

Parallelize the per-claim loops. Three places, all bounded ThreadPoolExecutor(max_workers=8):

  • Query-record creation (payload_builder.py:141) — over the deduped set
  • Evaluator fetch (evaluator.py:113)
  • Evaluator verify (evaluator.py:183)

Make set_interceptor_context optional. Align interceptor + payload-builder defaults to "_default". Single-agent users skip the wrap entirely. Multi-agent users keep labeling agents the way they always have — no behavior change for them.

Two small wins. Skip startup preprocess on padding-only tables (client.py:50). Drop polling floors to 0.05s with exponential ramp.

One sugar. provably.verify(claims) — a one-call wrapper around build_handoff_payload + evaluate_handoff. Old two-step API stays.


Code: now vs proposed for a single-agent user

# Now
provably.configure_indexing(enable_indexing=True)

set_interceptor_context(agent_id="demo", action_name="get_weather")  # mandatory
requests.get(...)

payload = provably.build_handoff_payload(claims)
verdict = provably.evaluate_handoff(
    payload, provably_base_url=..., postgres_url=..., org_id_fallback=...,
)
# Proposed
provably.configure_indexing(enable_indexing=True)

requests.get(...)  # no wrap

verdict = provably.verify(claims)

Files touched

  • _preprocess.py — worker thread, cond-var sync, adaptive polling
  • _storage.py:170mark_dirty() instead of sync preprocess
  • payload_builder.py — snapshot fence; dedupe claims by intercept row; parallel query-record creation; default intercept_agent_id="_default"
  • evaluator.py — parallel Phase 1+2 fetch and Phase 3 verify
  • interceptor.py:238 — default agent_id "_default"
  • client.py:50 — skip bootstrap preprocess on padding-only table
  • __init__.py — export verify()

No deletions. No breaking imports. No new required public surface.


How we verify

  • pytest tests/unit/, tests/e2e/test_interceptor_e2e.py, tests/e2e/test_post_handoff_e2e.py pass unchanged
  • time python examples/openai_agents/agent_run.py before vs after
  • Run customer-support-sdk-demo end-to-end — evaluate should drop substantially on multi-claim runs with no code changes
  • New concurrency test: two threads insert intercepts while a third calls _build_claims; verify the claims reflect the highest committed id

Open for discussion

  • Is the 50 ms debounce the right default, or should it be tunable?
  • Is max_workers=8 safe against Rust BE rate limits?
  • Should verify() accept the same kwargs as evaluate_handoff (timeout, etc.) or stay minimal?
  • Worker thread lifecycle: when does it start (first mark_dirty()? import-time?) and how does it stop (atexit? explicit shutdown()?). Needs to be nailed down in the PR.
  • Polling floor of 0.05s: chosen without measuring the actual preprocess-completion distribution from the Rust BE. If most preprocesses finish in 80–150 ms, 0.05s costs ~3× more polls than 0.3s with little payoff. Worth benchmarking before locking in.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions