-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Description
Describe the bug
CosmosAsyncContainer.readMany uses ParallelDocumentQueryExecutionContext for partition ranges with 2+ items. This execution context merges per-partition query results using Flux.mergeSequential(obs, fluxConcurrency, prefetch).
Flux.mergeSequential preserves upstream ordering by only requesting new items from the source iterator when the head (earliest-in-order) inner publisher completes (FluxMergeSequential.drain() → s.request(1)). If a non-head inner publisher completes, no new subscription
is triggered — the freed concurrency slot sits idle.
For readMany, this causes severe head-of-line blocking: a single slow partition response stalls all remaining partition fetches, even when concurrency slots are available.
My use-case based on cosmos db diagnostics log
readMany with 207 items spanning 55 partition key ranges, fluxSequentialMergeConcurrency = 6.
One partition (pkRangeId=92) had a transitTime of 982ms (backend latency was only 0.69ms — the delay was network/IO, not Cosmos server-side). The other 5 concurrent partition queries completed in 1-13ms each.
Expected: the 5 freed slots immediately dispatch the remaining 27 partition queries. Total wall clock ≈ 982ms + 27×3ms/5 ≈ ~1000ms.
Actual: no new partition queries were dispatched for 981ms until pkRangeId=92 completed. Only then did the remaining 27 queries start. Total wall clock = 1,020ms, of which 981ms was 5 idle concurrency slots.
Timeline (55 cross-partition queries, 6 point reads omitted):
Time (ms) In-flight Event
───────────────────────────────────────────
0.0 6 Initial 6 partition queries dispatched
0.5 – 17.0 6 Queries complete, head advances, slots refilled — working correctly
17.0 6 pk=92 starts as part of a new batch of 6 (pk=91,92,79,80,69,70)
17.5 – 18.5 6→5 pk=91 (head before 92) completes, pk=95 dispatched to refill slot
19.0 – 31.0 5→4→3→2→1 pk=70,80,79,95,69 complete — slots NOT refilled (pk=92 is now head)
31.0 – 999.0 1 Only pk=92 in-flight. 5 slots idle. 27 queries waiting.
999.0 6 pk=92 completes. Head advances through 5 already-done inners.
s.request() called 6 times. 6 new queries dispatched.
999.5 – 1020 6→... Remaining 27 queries drain normally.
Root cause
readMany does not require ordered results — it returns items as a collection with no ordering guarantee. However, it reuses ParallelDocumentQueryExecutionContext, which uses Flux.mergeSequential to support deterministic ordering for general cross-partition queries (important for continuation tokens / pagination).
readMany pays the head-of-line blocking cost of mergeSequential for ordering semantics it does not need.