Skip to content

Dispatch activity cancellation to worker using Nexus#9233

Open
rkannan82 wants to merge 55 commits intomainfrom
kannan/activity-cancel/dispatch-logic
Open

Dispatch activity cancellation to worker using Nexus#9233
rkannan82 wants to merge 55 commits intomainfrom
kannan/activity-cancel/dispatch-logic

Conversation

@rkannan82
Copy link
Copy Markdown
Contributor

@rkannan82 rkannan82 commented Feb 5, 2026

What

Dispatches worker commands (starting with activity cancellation) to workers via their Nexus control queue. When the outbound queue processes a WorkerCommandsTask, the dispatcher sends an ExecuteCommands Nexus operation to the worker's control queue via DispatchNexusTask. Retries are capped at 3 attempts since these commands are best-effort (the activity will eventually time out anyway).

Suggested review order: worker_commands_task_dispatcher.gorecordactivitytaskstarted/api.go (clock storage for task token reconstruction).

Why

To support activity cancellation without activity heartbeat. This is the dispatch leg of the flow:

  1. [Store worker attributes needed by server to propagate nexus tasks to worker #9231] Store worker_control_task_queue in ActivityInfo at activity start.
  2. [Add WorkerCommandsTask outbound task to dispatch worker commands via Nexus #9232] On RequestCancelActivityTask, batch commands by control queue into WorkerCommandsTask outbound tasks.
  3. [This PR] Dispatch each task as a Nexus ExecuteCommands operation to the worker, with a 3-attempt retry cap.
  4. [SDK] Worker receives the cancel command and cancels the running activity.

Gated by dynamic config EnableCancelActivityWorkerCommand (default: off).

How did you test it?

Unit tests cover all dispatch outcomes (success, RPC error, timeout, operation error, feature-flag-off, max-attempts-exceeded) and response-to-error conversion paths. Functional test verifies end-to-end: cancel request → Nexus dispatch → correct payload arrives on the control queue, and asserts that the cancel command's task token matches the one from the original activity poll response.

@rkannan82 rkannan82 requested review from a team as code owners February 5, 2026 21:20
@rkannan82 rkannan82 force-pushed the kannan/activity-cancel/dispatch-logic branch from a91654f to 6a655f7 Compare February 5, 2026 22:02
@rkannan82 rkannan82 force-pushed the kannan/activity-cancel/task-definition branch from d402868 to 54b6d1a Compare February 6, 2026 00:36
@rkannan82 rkannan82 changed the title Dispatch CancelActivityNexusTask to worker via Nexus control queue Dispatch CancelActivityNexusTask to worker using Nexus operation Feb 6, 2026
@rkannan82 rkannan82 changed the title Dispatch CancelActivityNexusTask to worker using Nexus operation Dispatch CancelActivityNexusTask to worker Feb 6, 2026
@rkannan82 rkannan82 force-pushed the kannan/activity-cancel/dispatch-logic branch 2 times, most recently from 13d8513 to eb79c81 Compare February 11, 2026 19:39
@rkannan82 rkannan82 force-pushed the kannan/activity-cancel/task-definition branch from d103589 to 417606c Compare February 11, 2026 20:10
@rkannan82 rkannan82 force-pushed the kannan/activity-cancel/dispatch-logic branch from eb79c81 to d1f2abf Compare February 11, 2026 20:11
@rkannan82 rkannan82 force-pushed the kannan/activity-cancel/task-definition branch from 417606c to 6246c4e Compare February 11, 2026 21:04
@rkannan82 rkannan82 force-pushed the kannan/activity-cancel/dispatch-logic branch from d1f2abf to 7489808 Compare February 11, 2026 21:04
@rkannan82 rkannan82 force-pushed the kannan/activity-cancel/task-definition branch from 6246c4e to fec3c41 Compare February 11, 2026 21:08
@rkannan82 rkannan82 force-pushed the kannan/activity-cancel/dispatch-logic branch from 37a51d2 to 11b1049 Compare February 11, 2026 21:08
@rkannan82 rkannan82 force-pushed the kannan/activity-cancel/task-definition branch from fec3c41 to 1dd975d Compare February 11, 2026 21:22
@rkannan82 rkannan82 force-pushed the kannan/activity-cancel/dispatch-logic branch 2 times, most recently from 45ac313 to e96dbe8 Compare February 11, 2026 21:30
@rkannan82 rkannan82 requested a review from yycptt February 11, 2026 22:25
@rkannan82 rkannan82 force-pushed the kannan/activity-cancel/task-definition branch from 11c74b6 to 2cb3108 Compare February 12, 2026 00:43
@rkannan82 rkannan82 force-pushed the kannan/activity-cancel/dispatch-logic branch 3 times, most recently from 72fe398 to 37c2a1c Compare February 12, 2026 07:09
@rkannan82 rkannan82 force-pushed the kannan/activity-cancel/task-definition branch from ac7c57e to 89d2cf5 Compare February 12, 2026 07:11
@rkannan82 rkannan82 force-pushed the kannan/activity-cancel/dispatch-logic branch 2 times, most recently from 5a9c85e to d1572d7 Compare February 12, 2026 07:18
Copy link
Copy Markdown
Member

@bergundy bergundy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you confirm whether you're covering cancel requests and pause requests?
Do we also care about canceling activities that we know timed out or are we letting the worker take care of that?
When will you be adding support for standalone activities too?

@rkannan82 rkannan82 force-pushed the kannan/activity-cancel/task-definition branch from 89d2cf5 to 79e4e3e Compare February 15, 2026 03:33
@rkannan82 rkannan82 force-pushed the kannan/activity-cancel/dispatch-logic branch 2 times, most recently from 84e5d58 to d5f5067 Compare April 1, 2026 05:24
Matching uses the clock from RecordActivityTaskStarted to build the task
token sent to the worker. Store this clock in ActivityInfo so that history
can later reconstruct the same task token (e.g. for cancel worker commands).

Key changes:
- Add started_clock field to ActivityInfo proto
- Create clock before AddActivityTaskStartedEvent so it's persisted in the
  same write (same pattern as WorkerControlTaskQueue)
- Return stored clock on retry path (same RequestId) so matching always
  gets the clock that the cancel handler will use
- Use binary/protobuf encoding for worker command payloads (SDK Core
  decodes these directly via prost, not through lang-SDK Nexus handlers)
- Cancel handler reconstructs task token using ai.StartedClock
- Functional test asserts cancel token matches the original activity token

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rkannan82 rkannan82 force-pushed the kannan/activity-cancel/dispatch-logic branch from d5f5067 to dfac6d2 Compare April 1, 2026 05:31
rkannan82 and others added 4 commits April 1, 2026 13:30
Worker commands are best-effort (activity will eventually time out anyway),
so retrying up to 70 times (the global DLQ default) wastes resources. Expose
the in-memory attempt counter from Executable and check it in the dispatcher
to drop tasks after 3 attempts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Records worker_commands_sent{outcome=max_attempts_exceeded} so dropped
tasks are observable in dashboards and alerts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@yycptt yycptt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't review the actual nexus request dispatch part closely. Nexus crew can do a better job than me I guess :) .

rkannan82 added a commit to temporalio/api that referenced this pull request Apr 7, 2026
## Summary

Defines a Nexus service for server-to-worker communication, starting
with activity cancellation support.

## Design Decision

We chose a **generic command API** (`ExecuteCommandsRequest` with
`oneof` command types) instead of a cancel-specific API. This allows a
future optimization to batch multiple commands (cancel, pause, etc) in a
single request and deliver to a worker in one RPC.

## Files

- `temporal/api/nexusservices/workerservice/v1/request_response.proto` -
request response definitions
- `nexus-rpc/temporal-proto-models-nexusrpc.yaml` - Nexus service
definition

## Related

- [Server PR](temporalio/temporal#9233)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
temporal-cicd bot pushed a commit to temporalio/api-go that referenced this pull request Apr 7, 2026
## Summary

Defines a Nexus service for server-to-worker communication, starting
with activity cancellation support.

## Design Decision

We chose a **generic command API** (`ExecuteCommandsRequest` with
`oneof` command types) instead of a cancel-specific API. This allows a
future optimization to batch multiple commands (cancel, pause, etc) in a
single request and deliver to a worker in one RPC.

## Files

- `temporal/api/nexusservices/workerservice/v1/request_response.proto` -
request response definitions
- `nexus-rpc/temporal-proto-models-nexusrpc.yaml` - Nexus service
definition

## Related

- [Server PR](temporalio/temporal#9233)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
spkane31 pushed a commit to temporalio/api that referenced this pull request Apr 9, 2026
## Summary

Defines a Nexus service for server-to-worker communication, starting
with activity cancellation support.

## Design Decision

We chose a **generic command API** (`ExecuteCommandsRequest` with
`oneof` command types) instead of a cancel-specific API. This allows a
future optimization to batch multiple commands (cancel, pause, etc) in a
single request and deliver to a worker in one RPC.

## Files

- `temporal/api/nexusservices/workerservice/v1/request_response.proto` -
request response definitions
- `nexus-rpc/temporal-proto-models-nexusrpc.yaml` - Nexus service
definition

## Related

- [Server PR](temporalio/temporal#9233)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
@rkannan82 rkannan82 requested review from a team as code owners April 10, 2026 21:39
@rkannan82 rkannan82 force-pushed the kannan/activity-cancel/dispatch-logic branch 2 times, most recently from 5d17c04 to 5521837 Compare April 10, 2026 21:50
- Add StartedClock nil check in cancel handler (backward compat for
  activities started before deploy)
- Add WorkerCommandsTask case to standby executor (drop task)
- Use shardCtx.GetConfig() instead of passing config param
- Add lock around Executable.Attempt() for thread safety
- Add replication comment on started_clock proto field
- Add tests for cancel command with/without StartedClock

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rkannan82 rkannan82 force-pushed the kannan/activity-cancel/dispatch-logic branch from 5521837 to 3e4267f Compare April 10, 2026 21:59
rkannan82 added a commit that referenced this pull request Apr 11, 2026
…worker (#9231)

## What changed?
As part of RecordActivityTaskStarted flow, store
worker_control_task_queue for an activity in the mutable state
(ActivityInfo).

Main changes:
- executions.proto: Added the new worker_control_task_queue field.
- mutable_state_impl.go: Update mutable state.
- matching/forwarder.go: Propagate worker_control_task_queue when polls
get forwarded. Otherwise, RecordActivityTaskStarted request will not
have it set when invoked from a forwarded poll.

## Why?
To support activity cancellation without activity heartbeat.

Overall flow:
- [This PR] Store worker attributes in ActivityInfo as part of
RecordActivityTaskStarted call.
- [#9232] When user cancels a workflow, create 1 or more tasks. Group
all activities belonging to a worker into the task (for efficiency).
- [#9233] Lookup the Nexus task queue for each worker, and send a Nexus
operation for each transfer task.
- [SDK] Worker will receive this cancel task and cancel the running
activities.

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [x] added new unit test(s)
- [ ] added new functional test(s)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
rkannan82 and others added 3 commits April 10, 2026 19:17
The replace directive pointed to a pre-release commit. The released
v1.62.8 includes all needed protos (WorkerCommand, etc).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The else-if branch already implies StartedEventId != EmptyEventID
since the prior if-branch handles the == case.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
rkannan82 added a commit that referenced this pull request Apr 11, 2026
…Nexus (#9232)

## What changed?

New outbound task type (`WorkerCommandsTask`) that carries worker
commands to be dispatched to workers via Nexus. Uses the generic
`WorkerCommand` proto (not cancel-activity-specific), so this task type
can carry any future command types.

Suggested review order: proto changes → `worker_commands_task.go` →
`task_generator.go` → `workflow_task_completed_handler.go`

Key pieces:
- **Proto**: `TASK_TYPE_WORKER_COMMANDS` enum, `WorkerCommandsTask` in
`OutboundTaskInfo` with `repeated WorkerCommand`.
- **Task definition**: `worker_commands_task.go` — implements outbound
`Task` and `HasDestination` interfaces.
- **Task creation** (`workflow_task_completed_handler.go`,
`task_generator.go`): When `RequestCancelActivityTask` is processed for
a started activity whose worker has a control queue, collects a
`CancelActivityCommand` with the activity's task token. Commands are
batched by destination control queue and flushed as one
`WorkerCommandsTask` per queue at the end of WFT processing.
- **Serialization**: `task_serializers.go` for persistence
round-tripping.

Dispatch is a no-op here — handled in #9233. Gated by dynamic config
`EnableCancelActivityWorkerCommand` (default: off).

## Why?

To support proactive activity cancellation without waiting for
heartbeat. This is the task creation leg of the flow.

1. [#9231] Store `worker_control_task_queue` in `ActivityInfo` at
activity start.
2. **[This PR]** On `RequestCancelActivityTask`, batch commands by
control queue into `WorkerCommandsTask` outbound tasks.
3. [#9233] Dispatch each task as a Nexus `ExecuteCommands` operation to
the worker, with a 3-attempt retry cap.
4. [SDK] Worker receives the cancel command and cancels the running
activity.

Gated by dynamic config `EnableCancelActivityWorkerCommand` (default:
off).


## How did you test it?

**Unit tests** cover task generation, command batching (including
multi-queue batching), task serialization round-tripping, and the
feature-flag-off path.

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
rkannan82 and others added 6 commits April 10, 2026 20:53
Resolves conflicts after #9924 (Executable.Attempt) and #9232
(task-definition) merged to main. Regenerated proto files for
started_clock field addition.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Restore tab alignment for TaskDiscarded in metric_defs.go
- Remove extra blank line in handler test
- Replace assert.Equal/Contains/Empty with require equivalents
  in dispatcher, handler, and dispatch response tests
- Remove unused assert imports

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Nexus service descriptor (WorkerService.ServiceName,
WorkerService.ExecuteCommands.Name()) is not yet published in
go.temporal.io/api v1.62.8. Use hardcoded constants until it is.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants