Skip to content

Resumed Python workflow task can be leased but not delivered before default poll timeout #4

@rmcdaniel

Description

@rmcdaniel

Summary

On a clean published-artifact polyglot stack, a resumed Python-authored workflow task can be created and leased after a Python activity completes, but the polling Python worker never receives a usable poll response and the workflow stays waiting until client result timeout.

Reproduction

Environment used:

  • durableworkflow/server:0.2.109
  • durable-workflow==0.4.33
  • shared queue polyglot-shared
  • one Python worker registered for both workflows and activities

Minimal failing case:

  1. Start a Python worker with the default DURABLE_WORKFLOW_POLL_TIMEOUT_SECONDS=30.
  2. Start workflow polyglot.python.calls_python with a simple dict payload.
  3. The Python worker completes the initial workflow task and schedules polyglot.activity.python.echo.
  4. The Python worker completes that activity successfully.
  5. The server creates a follow-up workflow task for the same run, but the workflow remains stuck in waiting and handle.result(timeout=240) times out.

Clean-stack evidence from the failing run:

  • workflow run python_calls_python-177b48f7 stayed waiting
  • initial workflow task 01krmce8ndt6yw1n0ryzc2yr4g completed successfully
  • activity execution 01krmce92e36datpbtfdvp2zcn completed successfully
  • follow-up workflow task 01krmcf7xaa4efkd2cj00gys90 was created and then leased to the Python worker
  • the worker never logged a successful workflow-tasks/poll response or workflow re-entry for that follow-up task

Relevant durable state from MySQL for the failing run:

  • run status: waiting
  • follow-up task row: 01krmcf7xaa4efkd2cj00gys90, status=leased, lease_owner=py-worker-4e478e0100d9-13
  • task payload was small metadata only:
    • {"open_wait_id":"activity:01krmce92e36datpbtfdvp2zcn","activity_type":"polyglot.activity.python.echo",...}

Expected

Once the Python activity completes, the follow-up workflow task should be returned cleanly through worker poll and the workflow should complete.

Actual

The follow-up task is created and leased, but the worker never gets a usable poll response for it, and the workflow remains stuck.

Workaround

Raising the Python worker poll timeout from 30 to 60 made both a targeted python_calls_python repro and the full four-corner polyglot smoke pass on the same published artifacts.

That suggests either:

  • the server is leasing the resumed task before the worker poll response is fully deliverable, or
  • the resumed-task poll response path is slow enough that the default Python poll timeout is too tight.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions