-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Description
Not sure if an issue is the right place to discuss this — happy to move the conversation elsewhere if needed.
Context
At Whatnot we use Phoenix.PubSub (backed by Registry) for livestreams. In #14654 we added {:duplicate, :key} partitioning which helped when we had many small livestreams, but it doesn't help with large ones (as discussed in the PR) — all subscribers for a key still end up in one duplicate_bag partition.
We hit a concrete problem: at a certain leave rate from a large livestream, the PID partition GenServer's message queue runs away because each :ets.match_delete/2 on the duplicate_bag has to scan all entries for that key to find the ones belonging to the leaving process. The partition can't keep up.
Idea
Use an ordered_set ETS table with composite keys {key, pid, counter} instead of duplicate_bag with {key,{pid, value}}. Since ordered_set sorts entries, all entries for a {key, pid} prefix are adjacent in the tree, so unregistering only touches that process's entries instead of scanning the whole key.
Combined with key-based partitioning (so lookup hits one partition, not all of them).
A PoC implementation and benchmarks are available at: https://github.com/studzien/elixir/tree/registry/duplicate-ordered-set
In our simulated load tests the lookup time increase was not noticeable in our workload and we no longer saw the queue runaway issue.
Benchmark results
Hot key workload — all N entries under a single key, 16 partitions, duplicate = pid-partitioned duplicate_bag, ordered = key-partitioned ordered_set:
Lookup
| N | duplicate | ordered | |
|---|---|---|---|
| 100 | 4.0 μs | 6.4 μs | ordered 1.6x slower |
| 1,000 | 29.5 μs | 67.3 μs | ordered 2.3x slower |
| 10,000 | 315 μs | 404 μs | ordered 1.3x slower |
| 1,000,000 | 149 ms | 94.9 ms | ordered 1.6x faster |
Ordered is slower at small N due to :ets.select/2 vs :ets.lookup_element/3. The gap closes as N grows, and at 1M ordered wins because it hits one partition instead of sixteen.
Unregister
| N | duplicate | ordered | |
|---|---|---|---|
| 100 | 2.9 μs | 3.3 μs | ~tied |
| 1,000 | 4.1 μs | 3.6 μs | ordered 1.1x faster |
| 10,000 | 14.7 μs | 4.0 μs | ordered 3.7x faster |
| 1,000,000 | 5.1 ms | 5.4 μs | ordered 940x faster |
This is the main win. At 1M entries unregister goes from 5ms to 5μs — this is what causes the queue runaway in the partition GenServer with duplicate_bag.
Tradeoffs
Lookup is ~1.5-2x slower at small to medium scale. Unregistration and crash cleanup are dramatically faster at scale. For our use case (large channels, high join/leave rate) the tradeoff is clearly worth it.
Next steps
We're happy to work on this and prepare a production-ready version of the code, but would appreciate some guidance on structuring the solution. The current Registry implementation already has a fair amount of branching for unique, duplicate partitioned by pid, and duplicate partitioned by key — adding another backend on top of that makes the code harder to follow. Would appreciate thoughts on whether the current approach (more case branches) is acceptable, or if there's a preferred way to organize this (e.g., separate modules per backend, behaviour, etc.).
Related: #14654