Skip to content

Move shutdown_worker RPC to initiate_shutdown#1197

Open
yuandrew wants to merge 17 commits intotemporalio:masterfrom
yuandrew:shutdown-rpc-initiate-shutdown
Open

Move shutdown_worker RPC to initiate_shutdown#1197
yuandrew wants to merge 17 commits intotemporalio:masterfrom
yuandrew:shutdown-rpc-initiate-shutdown

Conversation

@yuandrew
Copy link
Copy Markdown
Contributor

@yuandrew yuandrew commented Apr 3, 2026

What was changed

Moved shutdown_worker RPC to initiate_shutdown, which requires initiate_shutdown to now be an async fn

NOTE: also added a temp 5s timeout in order to avoid the race in temporalio/temporal#9545, will be safe to remove once the fix is fully deployed into cloud

Why?

shutdown_worker RPC needs to be sent at beginning of shutdown in order for server to know to send empty responses to pending polls.

Checklist

  1. Closes

  2. How was this tested:

Added graceful_shutdown_sends_shutdown_worker_rpc_during_initiate to ensure RPC is sent at beginning of shutdown process

  1. Any docs updates needed?

Note

Medium Risk
Changes worker shutdown sequencing and poll-stream termination behavior, which can affect in-flight task polling and shutdown reliability. Concurrency is introduced via a spawned RPC task and new capability plumbing, but it’s covered by added unit/integration tests.

Overview
Fixes graceful shutdown deadlocks by moving the ShutdownWorker RPC to Worker::initiate_shutdown (spawned in the background and awaited early in Worker::shutdown), ensuring the server is notified to flush long polls before the SDK waits for polls to drain.

Refactors server feature flags into a shared NamespaceCapabilities object (exported from sdk-core) and updates pollers to treat empty poll responses after shutdown as a termination signal rather than a timeout to retry. Adds targeted unit/integration tests for the new shutdown/poll behavior, enables the relevant dev-server dynamic config for tests, and bumps the integration-test CLI override version.

Reviewed by Cursor Bugbot for commit f2be7f7. Bugbot is set up for automated code reviews on this repo. Configure here.

@yuandrew yuandrew requested a review from a team as a code owner April 3, 2026 18:28
Comment thread crates/sdk-core/src/worker/mod.rs Outdated
Comment on lines -539 to -540
assert_eq!(workflow_task_slots.current_available_slots, 5);
assert_eq!(workflow_task_slots.current_used_slots, 1);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why'd these all go away?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the new shutdown flow taking a little longer, these checks aren't deterministic. These checks are technically all checked in in_activity_checks, kept the assert_eq!(workflow_task_slots.total_processed_tasks, 2); check, which is unique to the shutdown case

Comment thread crates/sdk-core/src/worker/mod.rs Outdated
Comment thread crates/sdk-core/tests/integ_tests/worker_heartbeat_tests.rs
Comment thread crates/sdk-core/src/worker/mod.rs Outdated
Comment thread crates/sdk-core/src/worker/mod.rs Outdated
Comment thread crates/sdk-core/src/worker/mod.rs Outdated
Comment thread crates/sdk-core/src/worker/workflow/wft_poller.rs Outdated
Comment thread crates/sdk-core/src/worker/mod.rs Outdated
@Sushisource Sushisource force-pushed the shutdown-rpc-initiate-shutdown branch from 20d28e0 to 5dcbf1c Compare April 15, 2026 18:04
Comment thread .cargo/config.toml
[env]
# This temporarily overrides the version of the CLI used for integration tests, locally and in CI
#CLI_VERSION_OVERRIDE = "v1.4.1-cloud-v1-29-0-139-2.0"
CLI_VERSION_OVERRIDE = "v1.6.3-serverless"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temporary CLI version override left uncommented

Medium Severity

CLI_VERSION_OVERRIDE is uncommented and set to "v1.6.3-serverless", pinning a specific CLI version for all integration tests (locally and in CI). The PR description explicitly states this is temporary ("will be safe to remove once the fix is fully deployed into cloud") and the reviewer confirmed it as "Intentionally temporarily checked in." This override changes test behavior globally and risks being forgotten after the upstream fix lands.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 5dcbf1c. Configure here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦‍♂️

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

Reviewed by Cursor Bugbot for commit f2be7f7. Configure here.

_ => {}
}
});
*guard = Some(handle);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sync initiate_shutdown now requires tokio runtime context

Medium Severity

initiate_shutdown is a public sync function (pub fn) that now calls tokio::spawn, which panics if no tokio runtime is available. Previously this function was purely synchronous and safe to call from any context (e.g., signal handlers, non-tokio threads). The C bridge correctly adds enter_sync!, but the Rust public API silently changed its contract without documenting the new runtime requirement.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit f2be7f7. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants