[BUG]readMany suffers from head-of-line blocking due to Flux.mergeSequential in ParallelDocumentQueryExecutionContext

**Describe the bug**

`CosmosAsyncContainer.readMany` uses `ParallelDocumentQueryExecutionContext` for partition ranges with 2+ items. This execution context merges per-partition query results using `Flux.mergeSequential(obs, fluxConcurrency, prefetch)`.                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                                
`Flux.mergeSequential` preserves upstream ordering by only requesting new items from the source iterator when the head (earliest-in-order) inner publisher completes (FluxMergeSequential.drain() → s.request(1)). If a non-head inner publisher completes, no new subscription 
  is triggered — the freed concurrency slot sits idle.
                                                                                                                                                                                                                                                                                
  **For `readMany`, this causes severe head-of-line blocking: a single slow partition response stalls all remaining partition fetches, even when concurrency slots are available.**

**My use-case based on cosmos db diagnostics log**

`readMany` with 207 items spanning 55 partition key ranges, fluxSequentialMergeConcurrency = 6.                                                                                                                                                  
                                         
One partition (pkRangeId=92) had a transitTime of 982ms (backend latency was only 0.69ms — the delay was network/IO, not Cosmos server-side). The other 5 concurrent partition queries completed in 1-13ms each.                                                              
                                                                                                                                                                                                                                                                              
  **Expected: the 5 freed slots immediately dispatch the remaining 27 partition queries.** Total wall clock ≈ 982ms + 27×3ms/5 ≈ ~1000ms.                                                                                                                                           
                                                                                                                                                                                                                                                                              
  **Actual: no new partition queries were dispatched for 981ms until pkRangeId=92 completed.** Only then did the remaining 27 queries start. Total wall clock = 1,020ms, of which 981ms was 5 idle concurrency slots.                                                               
                                                                                                                                                                                                                                                                              
  Timeline (55 cross-partition queries, 6 point reads omitted):                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                              
  Time (ms)       In-flight   Event                                                                                                                                                                                                                                             
  ───────────────────────────────────────────                                                                                                                                                                                                          
   0.0                    6          Initial 6 partition queries dispatched                                                                                                                                                                                                            
   0.5 – 17.0          6          Queries complete, head advances, slots refilled — working correctly                                                                                                                                                                               
   17.0                   6          pk=92 starts as part of a new batch of 6 (pk=91,92,79,80,69,70)                                                                                                                                                                                   
   17.5 – 18.5        6→5        pk=91 (head before 92) completes, pk=95 dispatched to refill slot                                                                                                                                                                                 
   19.0 – 31.0        5→4→3→2→1  pk=70,80,79,95,69 complete — slots NOT refilled (pk=92 is now head)                                                                                                                                                                               
   31.0 – 999.0     1          Only pk=92 in-flight. 5 slots idle. 27 queries waiting.                                                                                                                                                                                           
   999.0                6          pk=92 completes. Head advances through 5 already-done inners.                                                                                                                                                                                     
                              s.request() called 6 times. 6 new queries dispatched.                                                                                                                                                                                             
   999.5 – 1020    6→...      Remaining 27 queries drain normally.                                                                                                                                                                                                              
                                                                                                                                                                                                                                                                                
**Root cause**                                                                                                                                                                                                                                                                    
                                                                                                                                                                                                                                                                                
  `readMany` does not require ordered results — it returns items as a collection with no ordering guarantee. However, it reuses `ParallelDocumentQueryExecutionContext`, which uses `Flux.mergeSequential` to support deterministic ordering for general cross-partition queries (important for continuation tokens / pagination).
                                                                                                                                                                                                                                                                                
  `readMany` pays the head-of-line blocking cost of `mergeSequential` for ordering semantics it does not need.         



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]readMany suffers from head-of-line blocking due to Flux.mergeSequential in ParallelDocumentQueryExecutionContext #48441

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG]readMany suffers from head-of-line blocking due to Flux.mergeSequential in ParallelDocumentQueryExecutionContext #48441

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions