Skip to content

sleep() is not durable across worker crashes in Postgres World #679

@abdullah-hallaq

Description

@abdullah-hallaq

Description

Using Workflow DevKit with Postgres World, I’m observing non-durable behavior for sleep() when the worker/server is down at the moment the sleep duration expires.

According to the docs:

  • sleep: Suspends a workflow for a specified duration or until an end date without consuming any resources. Once the duration or end date passes, the workflow will resume execution.
  • Postgres World is described as a long‑running, persistent world using PostgreSQL + pg-boss.

Based on this, I expect sleep() to be durable across process crashes, even if the worker is offline at the exact wake-up time.

Observed behavior:

  • If the worker comes back before the sleep duration elapses, the workflow resumes as expected.
  • If the worker comes back after the sleep duration has already passed, the workflow never resumes and the run remains stuck in a running state.

Environment

  • World: postgres-world
  • Worker: running with environment variables per docs:
    • WORKFLOW_TARGET_WORLD=postgres-world
    • WORKFLOW_POSTGRES_URL=<same for app and worker>
    • WORKFLOW_POSTGRES_WORKER_CONCURRENCY set (default / reasonable value)
  • The worker and app both point to the same PostgreSQL instance

Reproduction steps

Workflow Test Scenario

  1. Log: "before sleep"
  2. sleep(5000) (5 seconds)
  3. Log: "after sleep"

Scenario A (works as expected)

  • [10:00] Start workflow; it calls sleep(5s)
  • [10:01] Crash/stop the server/worker
  • [10:04] Restart the server/worker (before 5s elapsed)
  • [10:05] Workflow resumes and logs "after sleep"

Scenario B (buggy behavior)

  • [10:00] Start workflow; it calls sleep(5s)
  • [10:01] Crash/stop the server/worker
  • [10:05+] Restart the server/worker (after 5s elapsed)

Result:

  • Workflow does not resume
  • Run remains stuck in running state in the database
  • No further steps execute

The only difference between scenarios A and B is whether the worker is alive at the wake-up time.

Minimal workflow:

import { sleep } from "workflow";

export async function handleUserSignup() {
  "use workflow";

  await logStarted();

  await sleep(2000);
  await log1();

  await sleep(2000);
  await log2();

  return { status: "success" };
}

async function logStarted() {
  "use step";
  console.log("Started");
}

async function log1() {
  "use step";
  console.log("Log 1 - Stop here to test crash recovery");
}

async function log2() {
  "use step";
  console.log("Log 2 - Success!");
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions