Azure Functions Host Dispatches Work Before Language Workers Are Initialized

### Problem Description

We are running Durable Functions orchestrations on Azure Functions Isolated Worker (.NET 10, Premium P1V3, Linux). Out of millions of executions, sometimes activities fail with two classes of exceptions:

1. **`System.InvalidOperationException: Did not find any initialized language workers`**
2. **`System.IO.FileNotFoundException: Could not load file or assembly`** (various assemblies: `System.IO.Hashing`, `Microsoft.ApplicationInsights`, `Polly`, `Microsoft.DurableTask.Grpc`)

During investigation, we found correlated host-level log records: **`"Worker channel is shutting down. Aborting function."`** and **`FunctionTimeoutException: Timeout value of 00:10:00 was exceeded`**, which revealed a cascading failure chain.

#### The cascade

Our `functionTimeout` is set to `00:10:00`. When a single activity exceeds this limit (e.g. due to an upstream API being slow), the host fires `FunctionTimeoutException`. However, the timeout does not only affect the timed-out activity — **all other activities concurrently running on the same worker instance are also interrupted**. The co-located activities that were in-flight receive **`WorkerProcessExitException: dotnet exited with code 135 (0x87)`** as the host kills the worker process. This is confirmed by our telemetry showing multiple unrelated activities failing at the exact same timestamp (within 2 seconds) on the same `AppRoleInstance`.

**Regarding this cascading shutdown:** it seems this behavior does not cause fatal failures on its own — activities interrupted by the channel shutdown are retried by the Durable Task framework (the failed activity queue messages are returned to the control queue and picked up by the next available worker). **Please confirm that such cascading shutdown of co-located activities is by design when one activity hits `functionTimeout`.** We are working on optimizing our implementation to avoid causing timeouts in the first place.

However, **what happens next is the real problem.** After the worker process is killed, the host begins restarting it. During this restart window, work continues to be dequeued from the Durable Task control queue and dispatched:

- **Activities dispatched before the new worker initializes** fail immediately (in 2ms) with **`InvalidOperationException: Did not find any initialized language workers`** — `RpcFunctionInvocationDispatcherLoadBalancer.GetLanguageWorkerChannel()` finds zero workers.

- **Activities dispatched while the new worker is starting but hasn't finished loading assemblies** fail with **`FileNotFoundException: Could not load file or assembly`** — the host considers the worker "initialized" (gRPC channel is up) but the CLR hasn't loaded all required assemblies yet.

The channel shutdown and restart should not cause this kind of fatal damage to unrelated activities. The host should not dispatch work to a worker that isn't ready.

#### Observed timeline (real production incident)

| Time (UTC) | Event |
|---|---|
| 10:13:04 | Multiple activities start on worker instance `04aab4...` |
| 10:33:22 | 4 activities (`finalize-resource-activity`, `get-resource` x3) exceed 10-min `functionTimeout` |
| 10:33:24 | Host logs: _"Worker channel is shutting down. Aborting function."_ |
| 10:33:24 | All pending activities on the instance aborted |
| 10:33:24 | `query-backlog-activity` dispatched to dead worker — fails in **2ms** with `InvalidOperationException` |
| 10:33:27 | Orchestrator replays, hits unhandled `TaskFailedException`, marks workflow as `Failed` |

#### Key observations

1. **One timeout kills many.** A single `FunctionTimeoutException` triggers a worker channel shutdown that aborts ALL co-located activities — not just the timed-out one. With `maxConcurrentActivityFunctions: 5`, a single timeout can cascade into up to 5 activity failures.

2. **The host dispatches work before workers are ready.** After the channel shutdown, the Durable Task storage provider continues polling control queues and dispatching messages. The host does not gate invocation on worker readiness — `RpcFunctionInvocationDispatcher.InvokeAsync` proceeds immediately and fails.

3. **"Initialized" doesn't mean "ready."** Even after a new worker starts and the gRPC channel connects, assembly loading is not complete. The host begins dispatching work during this window, causing `FileNotFoundException`.
---

**Product**: Azure Functions (Isolated Worker, .NET 10)  
**Plan**: Premium P1V3  
**Runtime SDK**: `azurefunctions: 4.1046.100.25616`  
**Durable Task Extension**: `Microsoft.Azure.WebJobs.Extensions.DurableTask 3.0.0`  
**Worker Extension**: `Microsoft.Azure.Functions.Worker.Extensions.DurableTask 1.16.3`  
**OS**: Linux (Ubuntu 24.04 containers)  
**Regions affected**: UK South, Canada Central, West Europe  

---


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Azure Functions Host Dispatches Work Before Language Workers Are Initialized #1338

Problem Description

The cascade

Observed timeline (real production incident)

Key observations

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Time (UTC)	Event
10:13:04	Multiple activities start on worker instance `04aab4...`
10:33:22	4 activities (`finalize-resource-activity`, `get-resource` x3) exceed 10-min `functionTimeout`
10:33:24	Host logs: "Worker channel is shutting down. Aborting function."
10:33:24	All pending activities on the instance aborted
10:33:24	`query-backlog-activity` dispatched to dead worker — fails in 2ms with `InvalidOperationException`
10:33:27	Orchestrator replays, hits unhandled `TaskFailedException`, marks workflow as `Failed`

Azure Functions Host Dispatches Work Before Language Workers Are Initialized #1338

Description

Problem Description

The cascade

Observed timeline (real production incident)

Key observations

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions