Skip to content

fix(world-postgres): self-heal LISTEN/NOTIFY interruptions#2

Open
Pom4H wants to merge 3 commits intomainfrom
botify/world-postgres-self-healing-v3
Open

fix(world-postgres): self-heal LISTEN/NOTIFY interruptions#2
Pom4H wants to merge 3 commits intomainfrom
botify/world-postgres-self-healing-v3

Conversation

@Pom4H
Copy link
Copy Markdown
Owner

@Pom4H Pom4H commented Apr 27, 2026

Summary

Make stream delivery durable across LISTEN/NOTIFY interruptions in @workflow/world-postgres. Tracks vercel#1855.

Why

The dedicated pg.Client used for LISTEN/NOTIFY is long-lived and eventually gets dropped by the server (idle TCP timeout, pgbouncer rotation, k8s CNI eviction — see brianc/node-postgres#967). Today, a single drop stops all stream delivery until process restart.

Changes

  1. listenChannel reconnects with bounded exponential backoff (250ms → 30s cap). Initial connect must succeed (callers expect a live subscription); subsequent reconnects are best-effort and logged.
  2. streams.get runs a periodic SELECT ... WHERE chunk_id > lastChunkId as the always-on safety net. NOTIFY remains the fast path; the poll catches anything dropped during reconnect, deduped by the existing enqueue ordering check.
  3. Configurable via PostgresWorldConfig.streamPollIntervalMs (default 5000ms; set to 0 to disable polling — only safe in tests where LISTEN cannot be interrupted).

Tests

packages/world-postgres/test/streamer.test.ts — three integration tests against testcontainers Postgres:

  • polling fallback — chunk inserted via raw SQL (no NOTIFY); only polling can deliver it. Reader receives data + EOF.
  • reader recovers after backend killpg_terminate_backend drops the LISTEN client, then streams.write publishes a chunk. NOTIFY fires into the void; polling delivers it.
  • listenChannel reconnects — low-level: kill the dedicated backend, wait past the initial 250ms backoff, fire NOTIFY. The handler receives the payload, proving the reconnect loop works.

Test plan

  • pnpm exec tsc --noEmit — clean
  • pnpm exec biome check — clean
  • pnpm exec vitest run test/streamer.test.ts — 3/3 passing in ~6.5s
  • Existing storage.test.ts and spec.test.ts unaffected (no shared state)
  • Once merged into main of this fork, ready to send upstream as a PR to vercel/workflow

🤖 Generated with Claude Code

The dedicated `pg.Client` used for `LISTEN/NOTIFY` is long-lived and
will eventually be dropped by the server (idle TCP timeout, pgbouncer
rotation, k8s CNI eviction). Previously a single drop stopped all
stream delivery until process restart.

Two changes make delivery durable:

1. `listenChannel` now reconnects with bounded exponential backoff
   (250ms → 30s cap). The initial connect must succeed; subsequent
   reconnects are best-effort and logged.

2. `streams.get` runs a periodic `SELECT ... WHERE chunk_id > lastChunkId`
   as a safety net for chunks delivered while the LISTEN socket was
   reconnecting. The poll dedupes against in-band notifications via the
   existing `enqueue` ordering check. Configurable via
   `PostgresWorldConfig.streamPollIntervalMs` (default 5000ms; 0 to
   disable).

Tracks vercel#1855.

Tests cover three failure modes via testcontainers:
- polling fallback delivers chunks inserted with NOTIFY suppressed
- reader still receives chunks after pg_terminate_backend kills LISTEN
- listenChannel itself reconnects and delivers post-reconnect notifies
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 27, 2026

🧪 E2E Test Results

Some tests failed

Summary

Passed Failed Skipped Total
❌ 💻 Local Development 900 2 48 950
✅ 📦 Local Production 902 0 48 950
✅ 🐘 Local Postgres 902 0 48 950
✅ 📋 Other 267 0 18 285
Total 2971 2 162 3135

❌ Failed Tests

💻 Local Development (2 failed)

vite-stable (2 failed):

  • error handling error propagation step errors basic step error preserves message and stack trace
  • error handling error propagation step errors cross-file step error preserves message and function names in stack

Details by Category

❌ 💻 Local Development
App Passed Failed Skipped
✅ astro-stable 89 0 6
✅ express-stable 89 0 6
✅ fastify-stable 89 0 6
✅ hono-stable 89 0 6
✅ nextjs-turbopack-stable 95 0 0
✅ nextjs-webpack-stable 95 0 0
✅ nitro-stable 89 0 6
✅ nuxt-stable 89 0 6
✅ sveltekit-stable 89 0 6
❌ vite-stable 87 2 6
✅ 📦 Local Production
App Passed Failed Skipped
✅ astro-stable 89 0 6
✅ express-stable 89 0 6
✅ fastify-stable 89 0 6
✅ hono-stable 89 0 6
✅ nextjs-turbopack-stable 95 0 0
✅ nextjs-webpack-stable 95 0 0
✅ nitro-stable 89 0 6
✅ nuxt-stable 89 0 6
✅ sveltekit-stable 89 0 6
✅ vite-stable 89 0 6
✅ 🐘 Local Postgres
App Passed Failed Skipped
✅ astro-stable 89 0 6
✅ express-stable 89 0 6
✅ fastify-stable 89 0 6
✅ hono-stable 89 0 6
✅ nextjs-turbopack-stable 95 0 0
✅ nextjs-webpack-stable 95 0 0
✅ nitro-stable 89 0 6
✅ nuxt-stable 89 0 6
✅ sveltekit-stable 89 0 6
✅ vite-stable 89 0 6
✅ 📋 Other
App Passed Failed Skipped
✅ e2e-local-dev-nest-stable 89 0 6
✅ e2e-local-postgres-nest-stable 89 0 6
✅ e2e-local-prod-nest-stable 89 0 6

📋 View full workflow run


Some E2E test jobs failed:

  • Vercel Prod: failure
  • Local Dev: failure
  • Local Prod: failure
  • Local Postgres: failure
  • Windows: cancelled

Check the workflow run for details.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 27, 2026

📊 Benchmark Results

workflow with no steps

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
💻 Local 🥇 Nitro 0.040s 1.005s 0.965s 10 1.00x
💻 Local Express 0.043s 1.005s 0.962s 10 1.06x
🐘 Postgres Next.js (Turbopack) 0.048s 1.010s 0.962s 10 1.19x
💻 Local Next.js (Turbopack) 0.048s 1.005s 0.957s 10 1.19x
🐘 Postgres Express 0.061s 1.009s 0.948s 10 1.52x
🐘 Postgres Nitro 0.065s 1.010s 0.946s 10 1.61x
workflow with 1 step

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Next.js (Turbopack) 1.113s 2.009s 0.897s 10 1.00x
💻 Local Next.js (Turbopack) 1.115s 2.006s 0.890s 10 1.00x
💻 Local Nitro 1.123s 2.005s 0.882s 10 1.01x
💻 Local Express 1.124s 2.006s 0.882s 10 1.01x
🐘 Postgres Express 1.137s 2.010s 0.874s 10 1.02x
🐘 Postgres Nitro 1.160s 2.009s 0.850s 10 1.04x
workflow with 10 sequential steps

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Next.js (Turbopack) 10.716s 11.019s 0.303s 3 1.00x
💻 Local Next.js (Turbopack) 10.810s 11.022s 0.212s 3 1.01x
🐘 Postgres Express 10.876s 11.018s 0.142s 3 1.01x
💻 Local Express 10.917s 11.022s 0.105s 3 1.02x
💻 Local Nitro 10.922s 11.022s 0.100s 3 1.02x
🐘 Postgres Nitro 10.976s 11.361s 0.384s 3 1.02x
workflow with 25 sequential steps

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Next.js (Turbopack) 14.078s 15.020s 0.942s 4 1.00x
🐘 Postgres Express 14.524s 15.023s 0.498s 4 1.03x
💻 Local Next.js (Turbopack) 14.652s 15.029s 0.377s 4 1.04x
🐘 Postgres Nitro 14.665s 15.025s 0.361s 4 1.04x
💻 Local Nitro 14.955s 15.030s 0.075s 4 1.06x
💻 Local Express 14.973s 15.028s 0.055s 4 1.06x
workflow with 50 sequential steps

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Next.js (Turbopack) 13.152s 13.599s 0.448s 7 1.00x
🐘 Postgres Express 13.944s 14.162s 0.219s 7 1.06x
🐘 Postgres Nitro 14.197s 15.026s 0.829s 6 1.08x
💻 Local Next.js (Turbopack) 16.240s 16.697s 0.457s 6 1.23x
💻 Local Nitro 16.541s 17.030s 0.490s 6 1.26x
💻 Local Express 16.680s 17.031s 0.351s 6 1.27x
Promise.all with 10 concurrent steps

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Next.js (Turbopack) 1.193s 2.008s 0.815s 15 1.00x
🐘 Postgres Express 1.251s 2.009s 0.759s 15 1.05x
🐘 Postgres Nitro 1.269s 2.011s 0.742s 15 1.06x
💻 Local Express 1.503s 2.005s 0.501s 15 1.26x
💻 Local Nitro 1.552s 2.005s 0.452s 15 1.30x
💻 Local Next.js (Turbopack) 1.625s 2.072s 0.447s 15 1.36x
Promise.all with 25 concurrent steps

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Nitro 2.329s 3.010s 0.681s 10 1.00x
🐘 Postgres Next.js (Turbopack) 2.350s 3.008s 0.658s 10 1.01x
🐘 Postgres Express 2.399s 3.009s 0.610s 10 1.03x
💻 Local Express 2.920s 3.109s 0.189s 10 1.25x
💻 Local Nitro 3.017s 3.758s 0.741s 8 1.30x
💻 Local Next.js (Turbopack) 3.044s 3.564s 0.520s 9 1.31x
Promise.all with 50 concurrent steps

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Express 3.469s 4.011s 0.542s 8 1.00x
🐘 Postgres Nitro 3.513s 4.012s 0.498s 8 1.01x
🐘 Postgres Next.js (Turbopack) 3.553s 4.011s 0.458s 8 1.02x
💻 Local Nitro 8.079s 8.521s 0.443s 4 2.33x
💻 Local Next.js (Turbopack) 8.232s 8.769s 0.537s 4 2.37x
💻 Local Express 8.236s 9.023s 0.787s 4 2.37x
Promise.race with 10 concurrent steps

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Next.js (Turbopack) 1.171s 2.008s 0.837s 15 1.00x
🐘 Postgres Nitro 1.254s 2.009s 0.755s 15 1.07x
🐘 Postgres Express 1.260s 2.008s 0.748s 15 1.08x
💻 Local Next.js (Turbopack) 1.510s 2.005s 0.495s 15 1.29x
💻 Local Express 1.523s 2.006s 0.483s 15 1.30x
💻 Local Nitro 1.532s 2.006s 0.474s 15 1.31x
Promise.race with 25 concurrent steps

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Express 2.338s 3.011s 0.672s 10 1.00x
🐘 Postgres Next.js (Turbopack) 2.355s 3.009s 0.654s 10 1.01x
🐘 Postgres Nitro 2.360s 3.010s 0.650s 10 1.01x
💻 Local Next.js (Turbopack) 2.923s 3.759s 0.836s 8 1.25x
💻 Local Nitro 3.004s 3.677s 0.673s 9 1.28x
💻 Local Express 3.084s 3.676s 0.592s 9 1.32x
Promise.race with 50 concurrent steps

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Express 3.456s 4.009s 0.554s 8 1.00x
🐘 Postgres Nitro 3.491s 4.012s 0.522s 8 1.01x
🐘 Postgres Next.js (Turbopack) 3.545s 4.010s 0.465s 8 1.03x
💻 Local Next.js (Turbopack) 7.926s 8.515s 0.589s 4 2.29x
💻 Local Nitro 8.632s 9.026s 0.394s 4 2.50x
💻 Local Express 8.921s 9.279s 0.358s 4 2.58x
workflow with 10 sequential data payload steps (10KB)

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Next.js (Turbopack) 0.605s 1.006s 0.401s 60 1.00x
🐘 Postgres Express 0.800s 1.006s 0.206s 60 1.32x
🐘 Postgres Nitro 0.843s 1.007s 0.163s 60 1.39x
💻 Local Next.js (Turbopack) 0.846s 1.005s 0.158s 60 1.40x
💻 Local Nitro 0.981s 1.136s 0.156s 53 1.62x
💻 Local Express 0.991s 1.201s 0.210s 51 1.64x
workflow with 25 sequential data payload steps (10KB)

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Next.js (Turbopack) 1.450s 2.006s 0.556s 45 1.00x
🐘 Postgres Express 1.898s 2.075s 0.178s 44 1.31x
🐘 Postgres Nitro 2.007s 2.564s 0.557s 36 1.38x
💻 Local Next.js (Turbopack) 2.674s 3.008s 0.334s 30 1.84x
💻 Local Nitro 2.994s 3.416s 0.422s 27 2.06x
💻 Local Express 3.027s 3.547s 0.520s 26 2.09x
workflow with 50 sequential data payload steps (10KB)

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Next.js (Turbopack) 3.054s 3.597s 0.543s 34 1.00x
🐘 Postgres Express 3.907s 4.110s 0.202s 30 1.28x
🐘 Postgres Nitro 4.066s 4.704s 0.638s 26 1.33x
💻 Local Next.js (Turbopack) 8.688s 9.017s 0.329s 14 2.84x
💻 Local Nitro 9.138s 9.788s 0.649s 13 2.99x
💻 Local Express 9.298s 10.101s 0.803s 12 3.04x
workflow with 10 concurrent data payload steps (10KB)

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Next.js (Turbopack) 0.198s 1.007s 0.809s 60 1.00x
🐘 Postgres Express 0.283s 1.007s 0.724s 60 1.43x
🐘 Postgres Nitro 0.291s 1.007s 0.717s 60 1.47x
💻 Local Express 0.577s 1.005s 0.428s 60 2.92x
💻 Local Nitro 0.583s 1.004s 0.421s 60 2.95x
💻 Local Next.js (Turbopack) 0.586s 1.005s 0.418s 60 2.97x
workflow with 25 concurrent data payload steps (10KB)

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Next.js (Turbopack) 0.404s 1.006s 0.601s 90 1.00x
🐘 Postgres Express 0.497s 1.006s 0.509s 90 1.23x
🐘 Postgres Nitro 0.499s 1.007s 0.507s 90 1.23x
💻 Local Nitro 2.471s 3.009s 0.537s 30 6.11x
💻 Local Express 2.501s 3.010s 0.509s 30 6.18x
💻 Local Next.js (Turbopack) 2.554s 3.009s 0.455s 30 6.31x
workflow with 50 concurrent data payload steps (10KB)

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Next.js (Turbopack) 0.616s 1.006s 0.390s 120 1.00x
🐘 Postgres Express 0.780s 1.007s 0.227s 120 1.27x
🐘 Postgres Nitro 0.810s 1.018s 0.207s 118 1.32x
💻 Local Next.js (Turbopack) 10.631s 11.299s 0.668s 11 17.26x
💻 Local Nitro 10.841s 11.299s 0.458s 11 17.60x
💻 Local Express 11.102s 11.755s 0.652s 11 18.02x
Stream Benchmarks (includes TTFB metrics)
workflow with stream

💻 Local Development

World Framework Workflow Time TTFB Slurp Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Next.js (Turbopack) 0.163s 1.001s 0.001s 1.009s 0.846s 10 1.00x
💻 Local Next.js (Turbopack) 0.183s 1.003s 0.013s 1.019s 0.836s 10 1.12x
💻 Local Nitro 0.200s 1.004s 0.011s 1.017s 0.817s 10 1.23x
💻 Local Express 0.201s 1.004s 0.012s 1.018s 0.818s 10 1.23x
🐘 Postgres Express 0.204s 0.996s 0.001s 1.009s 0.805s 10 1.25x
🐘 Postgres Nitro 0.211s 0.997s 0.001s 1.011s 0.799s 10 1.29x
stream pipeline with 5 transform steps (1MB)

💻 Local Development

World Framework Workflow Time TTFB Slurp Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Next.js (Turbopack) 0.538s 1.025s 0.005s 1.039s 0.501s 58 1.00x
🐘 Postgres Express 0.601s 1.003s 0.013s 1.031s 0.430s 59 1.12x
🐘 Postgres Nitro 0.630s 1.005s 0.012s 1.031s 0.401s 59 1.17x
💻 Local Nitro 0.740s 1.012s 0.009s 1.023s 0.283s 59 1.38x
💻 Local Express 0.745s 1.012s 0.009s 1.022s 0.277s 59 1.39x
💻 Local Next.js (Turbopack) 0.755s 1.010s 0.010s 1.116s 0.361s 54 1.40x
10 parallel streams (1MB each)

💻 Local Development

World Framework Workflow Time TTFB Slurp Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Next.js (Turbopack) 0.873s 1.052s 0.000s 1.058s 0.185s 58 1.00x
🐘 Postgres Express 0.951s 1.146s 0.000s 1.162s 0.211s 52 1.09x
🐘 Postgres Nitro 0.958s 1.192s 0.000s 1.209s 0.252s 51 1.10x
💻 Local Nitro 1.220s 2.020s 0.000s 2.022s 0.802s 30 1.40x
💻 Local Express 1.233s 2.021s 0.000s 2.023s 0.789s 30 1.41x
💻 Local Next.js (Turbopack) 1.254s 2.019s 0.000s 2.023s 0.768s 30 1.44x
fan-out fan-in 10 streams (1MB each)

💻 Local Development

World Framework Workflow Time TTFB Slurp Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Next.js (Turbopack) 1.719s 2.071s 0.000s 2.092s 0.372s 29 1.00x
🐘 Postgres Express 1.749s 2.066s 0.000s 2.091s 0.341s 29 1.02x
🐘 Postgres Nitro 1.817s 2.179s 0.000s 2.189s 0.372s 28 1.06x
💻 Local Nitro 3.518s 4.166s 0.000s 4.169s 0.651s 15 2.05x
💻 Local Express 3.547s 4.035s 0.001s 4.039s 0.491s 15 2.06x
💻 Local Next.js (Turbopack) 3.630s 4.165s 0.001s 4.169s 0.539s 15 2.11x

Summary

Fastest Framework by World

Winner determined by most benchmark wins

World 🥇 Fastest Framework Wins
💻 Local Next.js (Turbopack) 12/21
🐘 Postgres Next.js (Turbopack) 17/21
Fastest World by Framework

Winner determined by most benchmark wins

Framework 🥇 Fastest World Wins
Express 🐘 Postgres 18/21
Next.js (Turbopack) 🐘 Postgres 21/21
Nitro 🐘 Postgres 17/21
Column Definitions
  • Workflow Time: Runtime reported by workflow (completedAt - createdAt) - primary metric
  • TTFB: Time to First Byte - time from workflow start until first stream byte received (stream benchmarks only)
  • Slurp: Time from first byte to complete stream consumption (stream benchmarks only)
  • Wall Time: Total testbench time (trigger workflow + poll for result)
  • Overhead: Testbench overhead (Wall Time - Workflow Time)
  • Samples: Number of benchmark iterations run
  • vs Fastest: How much slower compared to the fastest configuration for this benchmark

Worlds:

  • 💻 Local: In-memory filesystem world (local development)
  • 🐘 Postgres: PostgreSQL database world (local development)
  • ▲ Vercel: Vercel production/preview deployment
  • 🌐 Turso: Community world (local development)
  • 🌐 MongoDB: Community world (local development)
  • 🌐 Redis: Community world (local development)
  • 🌐 Jazz: Community world (local development)

📋 View full workflow run


Some benchmark jobs failed:

  • Local: success
  • Postgres: success
  • Vercel: failure

Check the workflow run for details.

Pom4H added 2 commits April 28, 2026 00:34
`enqueue` previously decremented `offset` and returned without updating
`lastChunkId`. The new polling fallback re-queries
`chunk_id > lastChunkId` every tick, so chunks intentionally skipped for
`startIndex` would come back on the next poll and be skipped again —
double-decrementing `offset` and eventually mis-delivering them once
`offset` hit zero.

Move the high-water mark update to the top of `enqueue`, before the skip
branch. Adds a regression test that pre-seeds two chunks, opens the
reader with `startIndex=2`, lets several poll ticks fire (none should
deliver), then writes a third chunk and asserts only the third reaches
the reader.
Two reliability issues surfaced on review:

1. After natural EOF, `streams.get` set `closed = true` and closed the
   controller but never cleared the polling `setInterval` or removed the
   EventEmitter listener. The timer kept ticking (no-op via the `closed`
   guard) and the listener stayed attached for the lifetime of the
   process. Extracted an idempotent `stop()` that clears both, called
   from `cancel()` and from the EOF branch in `enqueue`. As a side
   benefit, the polling timer is no longer started at all if the initial
   chunk batch already delivered EOF.

2. `listenChannel.close()` called during an in-flight `connect()` could
   race: `closed = true` was set while `await next.connect()` /
   `LISTEN` was still resolving, after which the just-connected client
   would attach its notification listener and persist past close. Added
   a `closed` re-check after the awaits — if close raced ahead, end the
   client immediately and bail.

Test: a regression test spies on `setInterval`/`clearInterval` and
asserts that every interval the streamer scheduled at the configured
poll cadence is cleared by the time the consumer reads `done: true`,
without the consumer needing to call `cancel()`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant