Problem Description
We are running Durable Functions orchestrations on Azure Functions Isolated Worker (.NET 10, Premium P1V3, Linux). Out of millions of executions, sometimes activities fail with two classes of exceptions:
System.InvalidOperationException: Did not find any initialized language workers
System.IO.FileNotFoundException: Could not load file or assembly (various assemblies: System.IO.Hashing, Microsoft.ApplicationInsights, Polly, Microsoft.DurableTask.Grpc)
During investigation, we found correlated host-level log records: "Worker channel is shutting down. Aborting function." and FunctionTimeoutException: Timeout value of 00:10:00 was exceeded, which revealed a cascading failure chain.
The cascade
Our functionTimeout is set to 00:10:00. When a single activity exceeds this limit (e.g. due to an upstream API being slow), the host fires FunctionTimeoutException. However, the timeout does not only affect the timed-out activity — all other activities concurrently running on the same worker instance are also interrupted. The co-located activities that were in-flight receive WorkerProcessExitException: dotnet exited with code 135 (0x87) as the host kills the worker process. This is confirmed by our telemetry showing multiple unrelated activities failing at the exact same timestamp (within 2 seconds) on the same AppRoleInstance.
Regarding this cascading shutdown: it seems this behavior does not cause fatal failures on its own — activities interrupted by the channel shutdown are retried by the Durable Task framework (the failed activity queue messages are returned to the control queue and picked up by the next available worker). Please confirm that such cascading shutdown of co-located activities is by design when one activity hits functionTimeout. We are working on optimizing our implementation to avoid causing timeouts in the first place.
However, what happens next is the real problem. After the worker process is killed, the host begins restarting it. During this restart window, work continues to be dequeued from the Durable Task control queue and dispatched:
-
Activities dispatched before the new worker initializes fail immediately (in 2ms) with InvalidOperationException: Did not find any initialized language workers — RpcFunctionInvocationDispatcherLoadBalancer.GetLanguageWorkerChannel() finds zero workers.
-
Activities dispatched while the new worker is starting but hasn't finished loading assemblies fail with FileNotFoundException: Could not load file or assembly — the host considers the worker "initialized" (gRPC channel is up) but the CLR hasn't loaded all required assemblies yet.
The channel shutdown and restart should not cause this kind of fatal damage to unrelated activities. The host should not dispatch work to a worker that isn't ready.
Observed timeline (real production incident)
| Time (UTC) |
Event |
| 10:13:04 |
Multiple activities start on worker instance 04aab4... |
| 10:33:22 |
4 activities (finalize-resource-activity, get-resource x3) exceed 10-min functionTimeout |
| 10:33:24 |
Host logs: "Worker channel is shutting down. Aborting function." |
| 10:33:24 |
All pending activities on the instance aborted |
| 10:33:24 |
query-backlog-activity dispatched to dead worker — fails in 2ms with InvalidOperationException |
| 10:33:27 |
Orchestrator replays, hits unhandled TaskFailedException, marks workflow as Failed |
Key observations
-
One timeout kills many. A single FunctionTimeoutException triggers a worker channel shutdown that aborts ALL co-located activities — not just the timed-out one. With maxConcurrentActivityFunctions: 5, a single timeout can cascade into up to 5 activity failures.
-
The host dispatches work before workers are ready. After the channel shutdown, the Durable Task storage provider continues polling control queues and dispatching messages. The host does not gate invocation on worker readiness — RpcFunctionInvocationDispatcher.InvokeAsync proceeds immediately and fails.
-
"Initialized" doesn't mean "ready." Even after a new worker starts and the gRPC channel connects, assembly loading is not complete. The host begins dispatching work during this window, causing FileNotFoundException.
Product: Azure Functions (Isolated Worker, .NET 10)
Plan: Premium P1V3
Runtime SDK: azurefunctions: 4.1046.100.25616
Durable Task Extension: Microsoft.Azure.WebJobs.Extensions.DurableTask 3.0.0
Worker Extension: Microsoft.Azure.Functions.Worker.Extensions.DurableTask 1.16.3
OS: Linux (Ubuntu 24.04 containers)
Regions affected: UK South, Canada Central, West Europe
Problem Description
We are running Durable Functions orchestrations on Azure Functions Isolated Worker (.NET 10, Premium P1V3, Linux). Out of millions of executions, sometimes activities fail with two classes of exceptions:
System.InvalidOperationException: Did not find any initialized language workersSystem.IO.FileNotFoundException: Could not load file or assembly(various assemblies:System.IO.Hashing,Microsoft.ApplicationInsights,Polly,Microsoft.DurableTask.Grpc)During investigation, we found correlated host-level log records:
"Worker channel is shutting down. Aborting function."andFunctionTimeoutException: Timeout value of 00:10:00 was exceeded, which revealed a cascading failure chain.The cascade
Our
functionTimeoutis set to00:10:00. When a single activity exceeds this limit (e.g. due to an upstream API being slow), the host firesFunctionTimeoutException. However, the timeout does not only affect the timed-out activity — all other activities concurrently running on the same worker instance are also interrupted. The co-located activities that were in-flight receiveWorkerProcessExitException: dotnet exited with code 135 (0x87)as the host kills the worker process. This is confirmed by our telemetry showing multiple unrelated activities failing at the exact same timestamp (within 2 seconds) on the sameAppRoleInstance.Regarding this cascading shutdown: it seems this behavior does not cause fatal failures on its own — activities interrupted by the channel shutdown are retried by the Durable Task framework (the failed activity queue messages are returned to the control queue and picked up by the next available worker). Please confirm that such cascading shutdown of co-located activities is by design when one activity hits
functionTimeout. We are working on optimizing our implementation to avoid causing timeouts in the first place.However, what happens next is the real problem. After the worker process is killed, the host begins restarting it. During this restart window, work continues to be dequeued from the Durable Task control queue and dispatched:
Activities dispatched before the new worker initializes fail immediately (in 2ms) with
InvalidOperationException: Did not find any initialized language workers—RpcFunctionInvocationDispatcherLoadBalancer.GetLanguageWorkerChannel()finds zero workers.Activities dispatched while the new worker is starting but hasn't finished loading assemblies fail with
FileNotFoundException: Could not load file or assembly— the host considers the worker "initialized" (gRPC channel is up) but the CLR hasn't loaded all required assemblies yet.The channel shutdown and restart should not cause this kind of fatal damage to unrelated activities. The host should not dispatch work to a worker that isn't ready.
Observed timeline (real production incident)
04aab4...finalize-resource-activity,get-resourcex3) exceed 10-minfunctionTimeoutquery-backlog-activitydispatched to dead worker — fails in 2ms withInvalidOperationExceptionTaskFailedException, marks workflow asFailedKey observations
One timeout kills many. A single
FunctionTimeoutExceptiontriggers a worker channel shutdown that aborts ALL co-located activities — not just the timed-out one. WithmaxConcurrentActivityFunctions: 5, a single timeout can cascade into up to 5 activity failures.The host dispatches work before workers are ready. After the channel shutdown, the Durable Task storage provider continues polling control queues and dispatching messages. The host does not gate invocation on worker readiness —
RpcFunctionInvocationDispatcher.InvokeAsyncproceeds immediately and fails."Initialized" doesn't mean "ready." Even after a new worker starts and the gRPC channel connects, assembly loading is not complete. The host begins dispatching work during this window, causing
FileNotFoundException.Product: Azure Functions (Isolated Worker, .NET 10)
Plan: Premium P1V3
Runtime SDK:
azurefunctions: 4.1046.100.25616Durable Task Extension:
Microsoft.Azure.WebJobs.Extensions.DurableTask 3.0.0Worker Extension:
Microsoft.Azure.Functions.Worker.Extensions.DurableTask 1.16.3OS: Linux (Ubuntu 24.04 containers)
Regions affected: UK South, Canada Central, West Europe