diff --git a/pip/pip-474.md b/pip/pip-474.md new file mode 100644 index 0000000000000..1de9e0d7a08a0 --- /dev/null +++ b/pip/pip-474.md @@ -0,0 +1,397 @@ +# PIP-474: Key_Shared Hot Key Overflow Mechanism + +# Background Knowledge + +## Key_Shared Subscription + +Key_Shared is a subscription type provided by Pulsar. It distributes messages in a Topic to multiple Consumers based on the Key dimension: messages with the same Key are always routed to the same Consumer, ensuring per-key ordering; messages with different Keys can be distributed to different Consumers, enabling parallel consumption. + +Specifically, the Broker computes a Murmur3 hash over each message's sticky key, maps it to the range `[0, 65535]`, and then uses `StickyKeyConsumerSelector` to bind each hash value to a Consumer. All Consumers under a Key_Shared subscription share a single `ManagedCursor`, which maintains the global consumption progress (the `mark-delete` position). + +## What Problem It Solves + +In traditional sequential consumption scenarios, achieving per-key ordering typically requires routing messages of the same Key to the same partition, which is then processed by that partition's sole consumer in order. This means **consumption parallelism is limited by the number of partitions**—to add consumers you must add partitions, but the partition count is a fundamental Topic configuration that cannot be adjusted freely or frequently. + +Key_Shared subscriptions break this constraint: within the same partition, messages with different Keys can be distributed to different Consumers. Consumption parallelism depends only on the number of Keys and the number of Consumers, and is independent of the partition count. + +## What It Achieves + +1. **Multiple Key-Consumer Binding Strategies**: Provides three Selector implementations—`HashRangeAutoSplit` (evenly split hash ranges automatically), `HashRangeExclusive` (manually specify ranges), and `ConsistentHashing` (consistent hashing)—to suit different scenarios. + +2. **Ordering Guarantee During Consumer Scaling**: In AUTO_SPLIT mode (and when out-of-order delivery is not enabled), the Draining Hashes mechanism ([PIP-379](https://github.com/apache/pulsar/issues/21199)) ensures ordering when Consumers join or leave and hash ranges are reassigned. Affected hashes enter a "draining" state—the new Consumer must wait until the old Consumer has finished processing all pending messages for that hash before it starts receiving, ensuring no reordering during scaling. + +## But It Has a Significant Problem + +**A hot Key (or a subset of Keys stuck in processing) will block consumption of *all other Keys*.** + +This problem stems from the interaction of three mechanisms within the Key_Shared Dispatcher: + +**Replay Queue**: When a message cannot be dispatched (Consumer has no permits, hash is blocked, etc.), its position is placed into the Replay queue (`MessageRedeliveryController`). Replay reads typically take priority over Normal reads—as long as the Replay queue is non-empty and has dispatchable messages, no new messages are read from the Topic. However, there is a look-ahead exception: when all messages in a replay round are filtered out (e.g., due to consumers lacking permits), the dispatcher skips the next replay and proceeds directly to a Normal Read to fetch new messages forward (bounded by `effectiveLookAheadLimit`). + +**`containsStickyKeyHash` Ordering Check**: If a hash has unprocessed messages in the Replay queue, new messages for that hash read via Normal Read are also intercepted and redirected into Replay—new messages cannot be dispatched before old ones are processed. This is the core mechanism for guaranteeing per-key ordering. + +**Look-Ahead Limit**: When the Replay queue size reaches `effectiveLookAheadLimit` (default: `min(perConsumer × consumerCount, perSubscription)`), Normal Read is completely disabled and the Cursor stops reading new messages forward. + +## A Concrete Scenario + +Consider a typical consumption scenario: + +**Configuration**: +- Partitioned Topic with 200 partitions, 100 new Keys per second, each Key sends at 50 TPS for 100 seconds then stops, with a maximum of `100 × 100 = 10,000` active Keys concurrently +- Steady-state total throughput: `10,000 × 50 = 500,000 msg/s` +- Keys are evenly distributed by hash across 200 partitions, ~50 active Keys per partition, `50 × 50 = 2,500 msg/s` +- 20 Consumers, each Consumer connects to all 200 partitions (under Key_Shared subscription, the client automatically creates an internal consumer per partition) +- Each partition's Dispatcher sees 20 Consumers, each responsible for ~2–3 Keys in that partition +- Each message takes 15ms to process (normal business processing speed) +- Default configuration: `receiverQueueSize = 1000`, `keySharedLookAheadMsgInReplayThresholdPerConsumer = 2000`, `keySharedLookAheadMsgInReplayThresholdPerSubscription = 20000` +- Per-partition `effectiveLookAheadLimit = min(2000 × 20, 20000) = 20,000` + +### Normal Operation + +Consumption capacity covers production rate (`66.7 msg/s > 50 TPS`). Each partition has 20 Consumers processing ~50 Keys in parallel, yielding per-partition throughput of `50 × 66.7 ≈ 3,335 msg/s > 2,500 msg/s`. + +- Replay queues across all 200 partitions remain essentially empty; Normal Read advances normally +- **All 10,000 active Keys are consumed in real time; the system is healthy** + +### Single-Key Hot Spot / Partial Key Processing Stall + +Assume Consumer-1 hangs due to downstream service timeout (cannot ack any messages). Consumer-1 connects to all 200 partitions, responsible for ~2–3 Keys in each partition. + +**Within each partition**: + +**T = 0s onwards**: After Consumer-1's permits are exhausted, messages for its 2–3 Keys can no longer be dispatched and start flooding the partition's Replay queue. Each partition injects messages into Replay at ~125 msg/s (2.5 keys × 50 TPS)—messages that will never be acked. + +**T = 0–3 minutes**: Replay queues across all 200 partitions grow monotonically, **simultaneously** and **independently**, at ~125 msg/s. The `containsStickyKeyHash` check ensures that new messages for these hashes read via Normal Read are also intercepted into Replay. As Replay grows, `getMaxEntriesReadLimit()` returns progressively smaller values, Normal Read batches are compressed, and throughput for other Keys gradually degrades. + +**T ≈ 2.7 minutes — the fatal tipping point**: Replay queues across all 200 partitions fill to the 20,000-entry limit almost simultaneously (`20000 / 125 ≈ 160s ≈ 2.7min`). `isNormalReadAllowed()` returns false. Normal Read is permanently disabled. + +**T > 3 minutes — global blackout**: Dispatchers across all 200 partitions enter a pure Replay loop—but Replay is dominated by messages destined for Consumer-1, which is already stuck. **The remaining 19 Consumers have abundant free permits but cannot receive new messages on any partition. The vast majority of the 10,000 active Keys are starved.** + +``` +A single Consumer's failure spreads to all partitions in under 3 minutes, +turning "500K msg/s real-time consumption" into "global blackout." +``` + +This is the core problem this PIP aims to solve. + +# Motivation + +As illustrated above, the `containsStickyKeyHash` ordering check forms a **self-reinforcing positive feedback loop**: + +``` +hash in replay → Normal Read messages for same hash are blocked → enter replay → replay grows +→ Normal Read batch shrinks → replay reaches limit faster → Normal Read stops → global blackout +``` + +The original design intent of this check is entirely correct (M11 must not be dispatched before M10 is consumed). But under traffic skew or when some Consumers stall, this ordering mechanism becomes an avalanche accelerator. **A few Keys' problem escalates into a catastrophe for all Keys.** + +## Why Not Simply Increase the Replay Limit + +Increasing `keySharedLookAheadMsgInReplayThresholdPerConsumer` and `keySharedLookAheadMsgInReplayThresholdPerSubscription` may seem like the most straightforward mitigation, but there are three structural issues: + +1. **Amplified I/O waste**: A larger replay queue means more stuck-Key messages are read from BK each Replay Read cycle, yet all of them are returned unchanged because the target Consumer has no permits—pure I/O waste. +2. **Normal Read batch compression**: `getMaxEntriesReadLimit()` returns `max(effectiveLimit - replaySize, 1)`—larger replay means smaller Normal Read, merely delaying the blackout rather than solving it. +3. **Mark-delete gap inflation**: Stuck-Key messages remain unacked for extended periods, preventing mark-delete from advancing. `individualDeletedMessages` grows continuously and may exceed `managedLedgerMaxUnackedRangesToPersist`, causing confirmed ranges to be discarded and messages to be redelivered; on broker restart, the system must rescan all gaps from mark-delete. + +**We need a mechanism to completely remove stuck-Key messages from the main dispatch path so that neither Normal Read nor mark-delete is affected by them.** + +# Goals + +## In Scope + +- Completely remove stuck-Key messages from the main dispatch path so that Normal Read is no longer blocked and other Consumers are unaffected +- Allow the original cursor's mark-delete to advance normally, eliminating `individualDeletedMessages` gaps caused by stuck Keys +- Maintain at-least-once delivery guarantee +- Maintain per-key message ordering guarantee +- Completely transparent to producers and consumers—no client-side changes required +- Zero overhead during normal operation; detection is only triggered when replay queue pressure exceeds the watermark + +## Out of Scope + +- Improving single-Key consumption throughput (limited by single-Consumer processing speed; breaking through requires physical topic splitting, etc.) +- Parallel processing of the same Key by multiple Consumers (would break per-key ordering) +- Changes to other subscription types (Exclusive, Failover, Shared) + +# High Level Design + +Core idea: When detecting that a small number of Keys dominate the replay queue, **divert their messages to an independent Overflow ManagedLedger**. After successful write, ack the corresponding positions on the original cursor so that mark-delete can advance normally and the replay queue contracts. Dispatch from the Overflow ML on demand to Consumers. + +``` + ┌──────────────────────────────────────────┐ + │ PersistentStickyKeyDispatcher │ + │ │ + │ Original Cursor ───► filterAndGroupEntriesForDispatching()│ + │ (main topic ML) ├─ normal key ──► dispatch normally │ + │ └─ hot key ──► OverflowManager │ + │ │ │ + │ ┌───────────────▼───────────────┐ │ + │ │ Overflow ML │ │ + │ │ (per-subscription) │ │ + │ │ cursor ──► on-demand dispatch │ │ + │ └───────────────────────────────┘ │ + └──────────────────────────────────────────┘ +``` + +## Data Flow + +1. **Read messages from the original cursor** (Normal Read or Replay Read), yielding a batch of entries mixing normal keys and hot keys. + +2. **Divert in `filterAndGroupEntriesForDispatching`**: normal key messages are dispatched to their corresponding Consumers as usual; messages for Keys already marked as hot are written to the Overflow ML (BookKeeper). + +3. **Overflow write succeeds → ack original cursor**: the corresponding positions on the original cursor are individually acked, allowing mark-delete to advance past these positions, and messages for this hash are removed from the replay queue. + +4. **On-demand dispatch from Overflow**: when the target Consumer has permits, read from the Overflow cursor and dispatch. Other Consumers read normally from the original cursor, completely unaffected. + +## Core Guarantees + +**At-least-once**: Write to Overflow BK first → ack original cursor only after success. Any intermediate failure (overflow write failure, broker crash) means the original cursor has not advanced, so messages will be re-read. At worst, duplicates may appear in Overflow—which is permitted under at-least-once semantics. + +**Per-key ordering**: Messages for the same hash in the Overflow ML are written in the original read order, and the Overflow cursor reads them in order. The Draining Hashes mechanism works normally during Consumer changes. + +**Mark-delete advancement**: Hot key messages are acked on the original cursor and no longer block mark-delete. + +## Hot Key Detection Strategy + +A three-level "watermark + Top-N + dominance" detection strategy to avoid false positives in normal scenarios: + +1. **Watermark threshold**: Detection is only triggered when `replaySize ≥ watermark × effectiveLookAheadLimit` (default 80%)—zero overhead during normal operation. +2. **Top-N selection**: From `MessageRedeliveryController.hashesRefCount`, select the N hashes with the highest refCount (default 10). +3. **Dominance check**: The total message count of Top-N hashes must account for ≥50% of the replay queue—this excludes false positives from scenarios like Consumer restarts or uniformly slow consumption. + +## Overflow ML Granularity + +Each subscription has its own independent Overflow ML, named `{topic}__overflow__{subscription}`. + +Reason: Different subscriptions have different cursor positions. If they shared an Overflow ML, messages for the same Key written from different starting points would interleave in the append-only ML, breaking per-key ordering. Per-subscription write amplification is acceptable—this is a fallback behavior for extreme hot-key scenarios with limited data volume. + +# Detailed Design + +## Overflow Entry Format + +Each entry written to the Overflow ML contains a fixed header plus the complete payload of the original entry: + +``` +[4B magic][4B stickyKeyHash][8B originalLedgerId][8B originalEntryId][payload] +``` + +- `magic`: fixed value `0x4F564652` ("OVFR"), used for validation +- `stickyKeyHash`: the original message's sticky key hash, used for routing to the correct Consumer during dispatch +- `originalLedgerId` + `originalEntryId`: original position, used for log tracing and deduplication +- `payload`: complete content of the original entry (including metadata and body) + +Uses **single-entry writes** (each message is an independent BK `addEntry` call) rather than batch writes. Reasons: +- After each write succeeds, the corresponding position on the original cursor must be immediately acked; batch writes would require waiting for all to succeed before acking, increasing latency and complexity. +- Hot-Key message arrival is inherently a continuous stream, naturally suited for streaming single-entry processing. +- BK `addEntry` supports pipelining—single-entry writes do not imply synchronous blocking. + +## Diversion and Ack Protocol (Write-then-Ack) + +``` +for each hot key entry read from original cursor: + 1. asyncAddEntry(overflowML, buildOverflowEntry(entry)) + 2. on success: + a. originalCursor.asyncDelete(entry.position) // individual ack + b. replayController.removeMessage(entry.position) // remove from replay + 3. on failure: + // do not ack original cursor; message stays in replay for next retry + log.warn("Overflow write failed, will retry") +``` + +Key invariant: **acking the original cursor strictly follows successful Overflow write**. This guarantees: +- Write failure → message remains in the original cursor's replay, no loss +- Broker crash recovery → messages not advanced on the original cursor will be re-read, potentially creating duplicates in Overflow (permitted by at-least-once) + +## Dispatching from Overflow + +The Overflow ML has its own ManagedCursor. In each `readMoreEntries()` cycle, the Dispatcher checks Overflow in addition to Normal Read and Replay Read: + +``` +if (overflowManager.hasActiveOverflow()) { + for each hot hash with overflow entries: + Consumer consumer = selector.select(hash); + if (consumer != null && consumer.getAvailablePermits() > 0) { + overflowCursor.asyncReadEntries(batchSize, ...) + // after reading, parse header and dispatch to consumer + // after consumer acks, overflowCursor.asyncDelete(position) + } +} +``` + +Overflow dispatch is an **independent path** from Normal Read—it does not consume Normal Read batch quotas and does not affect throughput for other Keys. + +## Hot Key Resolution and Overflow Cleanup + +A hot key mark is cleared when all of the following conditions are met: + +1. The hash's refCount in the replay queue drops to 0 (Consumer has recovered) +2. A cooldown period has elapsed (default 60s, to prevent flapping) +3. All pending messages for this hash in the Overflow cursor have been fully dispatched + +After resolution: +- New messages for this Key return to the normal dispatch path +- When the Overflow ML has no remaining messages to consume, it is closed and deleted + +## Interaction with Draining Hashes + +During Consumer changes, the Draining Hashes mechanism requires affected hashes to be fully consumed by the old Consumer before switching to the new one. For hashes already diverted to Overflow: + +- Messages for this hash in the Overflow cursor must also be fully consumed by the old Consumer +- `DrainingHashesTracker.shouldBlockStickyKeyHash()` must consider pending status in both replay and overflow + +## Public-facing Changes + +### Public API + +No new REST APIs. The Overflow ML is automatically managed by the broker and is transparent externally. + +### Binary Protocol + +No changes. All logic is on the broker side; no client protocol changes are involved. + +### Configuration + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `keySharedHotKeyOverflowEnabled` | boolean | `false` | Master switch, disabled by default | +| `keySharedHotKeyDetectionWatermark` | double | `0.80` | Replay queue watermark to trigger detection (ratio of `effectiveLookAheadLimit`) | +| `keySharedHotKeyTopN` | int | `10` | Number of candidate hot hashes selected per detection | +| `keySharedHotKeyDominanceThreshold` | double | `0.50` | Minimum ratio of Top-N message count to replay queue; below this, not classified as hot | +| `keySharedHotKeyOverflowDispatchBatch` | int | `100` | Maximum messages per batch dispatched from Overflow | +| `keySharedHotKeyReleaseCooldownMs` | long | `60000` | Cooldown period (ms) after hot key resolution, to prevent flapping | + +All parameters are `dynamic = true`, adjustable at runtime without broker restart. + +### CLI + +No new CLI commands. + +### Metrics + +| Metric | Type | Tags | Description | +|--------|------|------|-------------| +| `pulsar_subscription_hot_key_overflow_total` | Counter | topic, subscription | Cumulative number of messages diverted to Overflow | +| `pulsar_subscription_hot_key_detected_total` | Counter | topic, subscription | Cumulative number of hot key detection triggers | +| `pulsar_subscription_hot_key_overflow_active` | Gauge | topic, subscription | Whether Overflow is active (0/1) | +| `pulsar_subscription_hot_key_overflow_pending` | Gauge | topic, subscription | Number of unconsumed messages in Overflow ML | + +# Monitoring + +- **`pulsar_subscription_hot_key_overflow_active == 1`**: A subscription is undergoing hot-key diversion; investigate stuck Consumers or abnormally high-frequency Keys on the business side. +- **`pulsar_subscription_hot_key_overflow_pending` growing continuously**: Messages in Overflow are not being consumed in time; the Consumer may have severely insufficient processing capacity or may be offline. +- **`pulsar_subscription_hot_key_overflow_total` rate spike**: A new hot-key event has occurred. + +# Security Considerations + +This PIP does not introduce new HTTP endpoints or protocol commands, and does not involve changes to the permission model. The Overflow ML is a ManagedLedger managed internally by the broker, following the same BK cluster, replication configuration, and authentication mechanisms as the original topic. There is no cross-tenant data access risk. + +# Backward & Forward Compatibility + +## Upgrade + +- Disabled by default (`keySharedHotKeyOverflowEnabled=false`); behavior is unchanged after upgrade. +- Can be enabled via dynamic configuration without broker restart. +- No client protocol changes; clients require no upgrade. + +## Downgrade / Rollback + +- Setting `keySharedHotKeyOverflowEnabled` to `false` stops new diversions. +- Messages already written to the Overflow ML: due to the write-then-ack protocol, the original cursor positions have already been acked, so messages will not be re-consumed on the original path. Unconsumed messages in the Overflow ML will no longer be dispatched after downgrade—equivalent to message loss (downgrade is a lossy operation). +- Residual Overflow MLs can be manually cleaned up via `pulsar-admin` ManagedLedger management interfaces. +- If not cleaned up, they do not affect broker operation and only consume a small amount of BK storage. + +## Pulsar Geo-Replication Upgrade & Downgrade/Rollback Considerations + +The Overflow ML is a broker-local scheduling optimization mechanism and does not participate in Geo-Replication. Brokers in each region perform hot-key detection and diversion independently, with no cross-region consistency impact. + +# Alternatives + +## A. Increase Replay Queue Limit + Extend individualDeletedMessages + +Increase `keySharedLookAheadMsgInReplayThresholdPerConsumer` and `keySharedLookAheadMsgInReplayThresholdPerSubscription` to delay the point at which `isNormalReadAllowed()` returns false. Combined with enabling roaring bitmap (`managedLedgerPersistIndividualAckAsLongArray=true` + `managedLedgerUnackedRangesOpenCacheSetEnabled=true`) and LZ4 compression to extend the persistence capacity of `individualDeletedMessages`, so that the system does not lose confirmed ranges due to truncation under larger ack-gap scales. + +This approach does not change any dispatch logic—it only enlarges capacity parameters and optimizes the persistence format. When stuck-key messages keep flooding in, the replay queue absorbs them with a larger capacity, while the `individualDeletedMessages` gaps produced by healthy-key acks are efficiently persisted via roaring bitmap. + +## B. Auxiliary Cursor + +Create an auxiliary ManagedCursor for the subscription, covering the same position range as the original cursor. When a hot key is detected, ack the hot key's messages on the original cursor (allowing mark-delete to advance), while not acking those messages on the auxiliary cursor (preserving their unconsumed state). When the target Consumer recovers, read and deliver from the auxiliary cursor. + +**Operation Protocol**: +- Original cursor: ack all messages (including hot key messages), allowing mark-delete to advance normally +- Auxiliary cursor: only ack non-hot-key messages; hot key messages remain in an unacknowledged state + +Through this dual-cursor collaboration, the original cursor's mark-delete can advance normally (eliminating replay pollution and gaps on the main path), while hot key messages retain their unconsumed state via the auxiliary cursor, to be delivered in order after the Consumer recovers. + +## C. Client-Side Routing + +Route hot key messages to a separate topic for processing, with client or Broker cooperation. Possible implementations include: + +- After the Broker detects a hot key, it notifies the client via protocol that certain hashes need to "pause sending", and the client forwards those messages to a dedicated topic +- Application-level routing logic: the Consumer determines whether a received message is for a hot key; if so, it produces the message to an independent topic and acks the original message +- Future extension: the Broker reassigns hot hashes to less-loaded Consumers + +This approach pushes hot key handling logic up to the client/application layer; the Broker side only needs to provide hot key notification capability. + +## D. Pure In-Memory Parking Zone + +Move hot key message positions from the replay queue to an independent in-memory Parking Zone, only reading from BK and dispatching when the Consumer has permits. Hot key messages no longer occupy replay queue capacity, and Normal Read resumes. However, messages remain in an "unacknowledged" state on the original cursor, so mark-delete does not advance. + +## E. External Metadata Storage for Positions + Ack Original Cursor + +An alternative implementation of the Auxiliary Cursor approach. + +## Comparative Analysis + +### Problem Classification + +Before comparing alternatives, we first clarify: among the cascading problems produced by Key_Shared when a Consumer is stuck, which ones **must** be solved by this PIP, and which can be mitigated through other means. + +| # | Problem | Mechanism | Must This PIP Solve It? | +|---|---------|-----------|------------------------| +| P1 | **Healthy key starvation** | The remaining 19 consumers have abundant permits but cannot receive new messages | **Must** — This is the core business impact; consumption of all other keys is completely stalled | +| P2 | **individualDeletedMessages gap inflation, high persistence and broker restart cost** | Ack gaps grow continuously, may exceed `maxUnackedRangesToPersist`, causing confirmed ranges to be truncated and discarded, leading to message redelivery after broker restart | **Worth solving but not mandatory** — Can be mitigated later by supporting larger `maxUnackedRangesToPersist` values (combined with roaring bitmap + compression) | +| P3 | **mark-delete cannot advance, data cannot be deleted** | Stuck key messages remain unacked for extended periods, pinning the cursor | **Worth solving but not mandatory** — Can be mitigated by increasing disk capacity; hot keys typically do not stay the same forever—different keys take turns being hot, and mark-delete will naturally advance after the hot key subsides | +| P4 | **Replay Read I/O waste** | Each cycle reads stuck-key messages from BK, finds no permits, puts them back—O(replaySize) per cycle | **Not mandatory** — Can be optimized by other PIPs improving replay read efficiency (e.g., filtering by hash, skipping messages for consumers known to have no permits) | +| P5 | **Backlog/consumption position inaccuracy** (operational and user-friendliness) | Backlog includes non-consumable stuck-key messages; users see the consumption position not advancing and backlog growing continuously | **Not mandatory** — More of a user mental model issue. Other MQs typically use an offset-advancing + retry-queue/dead-letter-queue model where users are accustomed to seeing offset move forward; under Pulsar's individual ack model, backlog semantics are inherently different | + +**This PIP's baseline is solving P1 (healthy key starvation) and P2 (large-scale re-consumption after ack gap exceeds the limit)**—these are the only two problems that directly cause business harm. If other problems can be solved along the way, that's even better, but a scheme should not be rejected simply because it doesn't solve secondary problems. + +### Core Problem Resolution Capability + +| Problem | Overflow ML (PIP-474) | A. Enlarge Replay | B. Auxiliary Cursor | C. Client-Side Routing | D. Pure In-Memory Parking Zone | E. Metadata Storage for Positions | +|---------|:---:|:---:|:---:|:---:|:---:|:---:| +| **P1 Healthy key starvation** | ✅ Hot hashes removed from replay, Normal Read resumes | ❌ Only delays; sustained hot key eventually still causes blackout | ✅ Original cursor advances, replay cleared | ❌ Stuck consumer cannot self-rescue | ✅ Hot messages moved out of replay, Normal Read resumes | ❌ Ledger GC causes data loss | +| P2 Ack gap inflation | ✅ mark-delete advances, gaps eliminated | ⚠️ Extends persistence capacity via roaring bitmap | ⚠️ Original cursor solved, but auxiliary cursor produces new gaps | ❌ | ❌ mark-delete does not advance, gaps keep growing | ⚠️ Conflicts with Ledger GC | +| P3 mark-delete stuck | ✅ Hot messages acked on original cursor | ❌ | ✅ Hot messages acked on original cursor | ❌ | ❌ Messages remain unacknowledged | ⚠️ Can advance after ack, but GC causes message loss | +| P4 Replay I/O waste | ✅ No stuck messages in replay | ❌ Worse—larger replay means more waste per cycle | ✅ No waste on main path | ❌ | ✅ Hot messages not in replay | ❌ | +| P5 Backlog accuracy | ✅ | ❌ | ⚠️ Original subscription accurate, but auxiliary cursor has its own backlog | ❌ | ❌ mark-delete does not advance | ❌ | + +### Cost Introduced by Each Alternative + +| Dimension | Overflow ML (PIP-474) | A. Enlarge Replay | B. Auxiliary Cursor | C. Client-Side Routing | D. Pure In-Memory Parking Zone | E. Metadata Storage for Positions | +|-----------|---|---|---|---|---|---| +| Extra storage | Hot key messages copied to overflow storage | Original ledger storage duration increases | Original ledger storage duration increases | Hot key messages copied to new topic storage | Original ledger storage duration increases | External metadata service stores position info; original ledger storage duration increases | +| Extra I/O | Hot messages written to Overflow BK (one-time write) | Wasted replay I/O grows linearly with replay size | Each message requires ack on two cursors | Messages re-produced to new topic | On-demand reads from BK, no wasted writes | Metadata writes + positional BK reads | +| Sustained hot key viability | ✅ Stable operation (overflow dispatches on demand) | ❌ Eventually fills up | ⚠️ No problem if different keys take turns being hot; if the same key stays hot, operators can intervene within disk capacity to push business investigation | ❌ Stuck consumer cannot perform routing | ⚠️ Memory grows continuously, Broker OOM risk | ❌ Data lost after Ledger GC | +| Client transparency | ✅ Fully transparent | ✅ Fully transparent | ✅ Fully transparent | ❌ Requires client changes | ✅ Fully transparent | ✅ Fully transparent | + +### Summary + +Among all alternatives, only **Overflow ML** and **Auxiliary Cursor** can solve the core P1 and P2 problems. The main differences between the two are: + +- **Overflow ML**: Writes the **complete content** of hot key messages to an independent ML; after acking on the original cursor, mark-delete advances. Extra storage is only the hot key message volume (~5%), and ledgers can be GC'd normally. +- **Auxiliary Cursor**: Creates a new cursor on the original ML to preserve the unconsumed state of hot key messages. No extra data writes are produced, but the auxiliary cursor's mark-delete is pinned, preventing the entire ledger (100%) from being GC'd, and the auxiliary cursor itself faces ack gap inflation. + +**Alternative A (Enlarge Replay)** cannot solve P1—for sustained hot keys, no matter how large the replay limit is set, it will eventually fill up. **Alternative C (Client-Side Routing)** faces the fundamental difficulty that a stuck consumer cannot self-rescue. Alternative D (Pure In-Memory Parking Zone) is essentially the same approach as Alternative A, and will ultimately be constrained by the ack hole limit. Alternative E (Metadata Storage for Positions) is similar in concept to Alternative B, but requires additional mechanisms to prevent ledger deletion, and is lower priority than Alternative B (Auxiliary Cursor). + +Overall, Overflow ML is the solution that solves P1 while maintaining all constraints (at-least-once + per-key ordering + client transparency), and additionally resolves P4/P5/P6/P7. Its cost (additional BK writes and temporary double storage) is proportional to hot-key throughput, and is automatically cleaned up after the hot key subsides. + +# General Notes + +- This PIP only applies to `PersistentStickyKeyDispatcherMultipleConsumers` (including the Draining Hashes enhancement from PIP-379); it does not affect the Classic dispatcher or non-persistent dispatchers. +- The Overflow ML uses the same BK cluster and replication configuration as the original topic. +- The additional overhead during normal operation (no hot keys) is a single integer comparison (`replaySize >= watermark × limit`), approximately 1ns. + +# Links + +* Mailing List discussion thread: +* Mailing List voting thread: \ No newline at end of file