Skip to content

Proposal: ordered_set backend for duplicate Registry #15128

@studzien

Description

@studzien

Not sure if an issue is the right place to discuss this — happy to move the conversation elsewhere if needed.

Context

At Whatnot we use Phoenix.PubSub (backed by Registry) for livestreams. In #14654 we added {:duplicate, :key} partitioning which helped when we had many small livestreams, but it doesn't help with large ones (as discussed in the PR) — all subscribers for a key still end up in one duplicate_bag partition.

We hit a concrete problem: at a certain leave rate from a large livestream, the PID partition GenServer's message queue runs away because each :ets.match_delete/2 on the duplicate_bag has to scan all entries for that key to find the ones belonging to the leaving process. The partition can't keep up.

Idea

Use an ordered_set ETS table with composite keys {key, pid, counter} instead of duplicate_bag with {key,{pid, value}}. Since ordered_set sorts entries, all entries for a {key, pid} prefix are adjacent in the tree, so unregistering only touches that process's entries instead of scanning the whole key.

Combined with key-based partitioning (so lookup hits one partition, not all of them).

A PoC implementation and benchmarks are available at: https://github.com/studzien/elixir/tree/registry/duplicate-ordered-set

In our simulated load tests the lookup time increase was not noticeable in our workload and we no longer saw the queue runaway issue.

Benchmark results

Hot key workload — all N entries under a single key, 16 partitions, duplicate = pid-partitioned duplicate_bag, ordered = key-partitioned ordered_set:

Lookup

N duplicate ordered
100 4.0 μs 6.4 μs ordered 1.6x slower
1,000 29.5 μs 67.3 μs ordered 2.3x slower
10,000 315 μs 404 μs ordered 1.3x slower
1,000,000 149 ms 94.9 ms ordered 1.6x faster

Ordered is slower at small N due to :ets.select/2 vs :ets.lookup_element/3. The gap closes as N grows, and at 1M ordered wins because it hits one partition instead of sixteen.

Unregister

N duplicate ordered
100 2.9 μs 3.3 μs ~tied
1,000 4.1 μs 3.6 μs ordered 1.1x faster
10,000 14.7 μs 4.0 μs ordered 3.7x faster
1,000,000 5.1 ms 5.4 μs ordered 940x faster

This is the main win. At 1M entries unregister goes from 5ms to 5μs — this is what causes the queue runaway in the partition GenServer with duplicate_bag.

Tradeoffs

Lookup is ~1.5-2x slower at small to medium scale. Unregistration and crash cleanup are dramatically faster at scale. For our use case (large channels, high join/leave rate) the tradeoff is clearly worth it.

Next steps

We're happy to work on this and prepare a production-ready version of the code, but would appreciate some guidance on structuring the solution. The current Registry implementation already has a fair amount of branching for unique, duplicate partitioned by pid, and duplicate partitioned by key — adding another backend on top of that makes the code harder to follow. Would appreciate thoughts on whether the current approach (more case branches) is acceptable, or if there's a preferred way to organize this (e.g., separate modules per backend, behaviour, etc.).

Related: #14654

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions