Skip to content

Azure Functions Host Dispatches Work Before Language Workers Are Initialized #1338

@andrey-malkov

Description

@andrey-malkov

Problem Description

We are running Durable Functions orchestrations on Azure Functions Isolated Worker (.NET 10, Premium P1V3, Linux). Out of millions of executions, sometimes activities fail with two classes of exceptions:

  1. System.InvalidOperationException: Did not find any initialized language workers
  2. System.IO.FileNotFoundException: Could not load file or assembly (various assemblies: System.IO.Hashing, Microsoft.ApplicationInsights, Polly, Microsoft.DurableTask.Grpc)

During investigation, we found correlated host-level log records: "Worker channel is shutting down. Aborting function." and FunctionTimeoutException: Timeout value of 00:10:00 was exceeded, which revealed a cascading failure chain.

The cascade

Our functionTimeout is set to 00:10:00. When a single activity exceeds this limit (e.g. due to an upstream API being slow), the host fires FunctionTimeoutException. However, the timeout does not only affect the timed-out activity — all other activities concurrently running on the same worker instance are also interrupted. The co-located activities that were in-flight receive WorkerProcessExitException: dotnet exited with code 135 (0x87) as the host kills the worker process. This is confirmed by our telemetry showing multiple unrelated activities failing at the exact same timestamp (within 2 seconds) on the same AppRoleInstance.

Regarding this cascading shutdown: it seems this behavior does not cause fatal failures on its own — activities interrupted by the channel shutdown are retried by the Durable Task framework (the failed activity queue messages are returned to the control queue and picked up by the next available worker). Please confirm that such cascading shutdown of co-located activities is by design when one activity hits functionTimeout. We are working on optimizing our implementation to avoid causing timeouts in the first place.

However, what happens next is the real problem. After the worker process is killed, the host begins restarting it. During this restart window, work continues to be dequeued from the Durable Task control queue and dispatched:

  • Activities dispatched before the new worker initializes fail immediately (in 2ms) with InvalidOperationException: Did not find any initialized language workersRpcFunctionInvocationDispatcherLoadBalancer.GetLanguageWorkerChannel() finds zero workers.

  • Activities dispatched while the new worker is starting but hasn't finished loading assemblies fail with FileNotFoundException: Could not load file or assembly — the host considers the worker "initialized" (gRPC channel is up) but the CLR hasn't loaded all required assemblies yet.

The channel shutdown and restart should not cause this kind of fatal damage to unrelated activities. The host should not dispatch work to a worker that isn't ready.

Observed timeline (real production incident)

Time (UTC) Event
10:13:04 Multiple activities start on worker instance 04aab4...
10:33:22 4 activities (finalize-resource-activity, get-resource x3) exceed 10-min functionTimeout
10:33:24 Host logs: "Worker channel is shutting down. Aborting function."
10:33:24 All pending activities on the instance aborted
10:33:24 query-backlog-activity dispatched to dead worker — fails in 2ms with InvalidOperationException
10:33:27 Orchestrator replays, hits unhandled TaskFailedException, marks workflow as Failed

Key observations

  1. One timeout kills many. A single FunctionTimeoutException triggers a worker channel shutdown that aborts ALL co-located activities — not just the timed-out one. With maxConcurrentActivityFunctions: 5, a single timeout can cascade into up to 5 activity failures.

  2. The host dispatches work before workers are ready. After the channel shutdown, the Durable Task storage provider continues polling control queues and dispatching messages. The host does not gate invocation on worker readiness — RpcFunctionInvocationDispatcher.InvokeAsync proceeds immediately and fails.

  3. "Initialized" doesn't mean "ready." Even after a new worker starts and the gRPC channel connects, assembly loading is not complete. The host begins dispatching work during this window, causing FileNotFoundException.


Product: Azure Functions (Isolated Worker, .NET 10)
Plan: Premium P1V3
Runtime SDK: azurefunctions: 4.1046.100.25616
Durable Task Extension: Microsoft.Azure.WebJobs.Extensions.DurableTask 3.0.0
Worker Extension: Microsoft.Azure.Functions.Worker.Extensions.DurableTask 1.16.3
OS: Linux (Ubuntu 24.04 containers)
Regions affected: UK South, Canada Central, West Europe


Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions