Skip to content

[BUG]readMany suffers from head-of-line blocking due to Flux.mergeSequential in ParallelDocumentQueryExecutionContext #48441

@zoltan-toth-mw

Description

@zoltan-toth-mw

Describe the bug

CosmosAsyncContainer.readMany uses ParallelDocumentQueryExecutionContext for partition ranges with 2+ items. This execution context merges per-partition query results using Flux.mergeSequential(obs, fluxConcurrency, prefetch).

Flux.mergeSequential preserves upstream ordering by only requesting new items from the source iterator when the head (earliest-in-order) inner publisher completes (FluxMergeSequential.drain() → s.request(1)). If a non-head inner publisher completes, no new subscription
is triggered — the freed concurrency slot sits idle.

For readMany, this causes severe head-of-line blocking: a single slow partition response stalls all remaining partition fetches, even when concurrency slots are available.

My use-case based on cosmos db diagnostics log

readMany with 207 items spanning 55 partition key ranges, fluxSequentialMergeConcurrency = 6.

One partition (pkRangeId=92) had a transitTime of 982ms (backend latency was only 0.69ms — the delay was network/IO, not Cosmos server-side). The other 5 concurrent partition queries completed in 1-13ms each.

Expected: the 5 freed slots immediately dispatch the remaining 27 partition queries. Total wall clock ≈ 982ms + 27×3ms/5 ≈ ~1000ms.

Actual: no new partition queries were dispatched for 981ms until pkRangeId=92 completed. Only then did the remaining 27 queries start. Total wall clock = 1,020ms, of which 981ms was 5 idle concurrency slots.

Timeline (55 cross-partition queries, 6 point reads omitted):

Time (ms) In-flight Event
───────────────────────────────────────────
0.0 6 Initial 6 partition queries dispatched
0.5 – 17.0 6 Queries complete, head advances, slots refilled — working correctly
17.0 6 pk=92 starts as part of a new batch of 6 (pk=91,92,79,80,69,70)
17.5 – 18.5 6→5 pk=91 (head before 92) completes, pk=95 dispatched to refill slot
19.0 – 31.0 5→4→3→2→1 pk=70,80,79,95,69 complete — slots NOT refilled (pk=92 is now head)
31.0 – 999.0 1 Only pk=92 in-flight. 5 slots idle. 27 queries waiting.
999.0 6 pk=92 completes. Head advances through 5 already-done inners.
s.request() called 6 times. 6 new queries dispatched.
999.5 – 1020 6→... Remaining 27 queries drain normally.

Root cause

readMany does not require ordered results — it returns items as a collection with no ordering guarantee. However, it reuses ParallelDocumentQueryExecutionContext, which uses Flux.mergeSequential to support deterministic ordering for general cross-partition queries (important for continuation tokens / pagination).

readMany pays the head-of-line blocking cost of mergeSequential for ordering semantics it does not need.

Metadata

Metadata

Assignees

Labels

customer-reportedIssues that are reported by GitHub users external to the Azure organization.needs-triageWorkflow: This is a new issue that needs to be triaged to the appropriate team.questionThe issue doesn't require a change to the product in order to be resolved. Most issues start as that

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions