Skip to content

Bug report: stale namespace persists in gossip mesh after publisher disconnect#1148

Open
erikherz wants to merge 1 commit intomoq-dev:devfrom
erikherz:gossip-stale-namespace-bug
Open

Bug report: stale namespace persists in gossip mesh after publisher disconnect#1148
erikherz wants to merge 1 commit intomoq-dev:devfrom
erikherz:gossip-stale-namespace-bug

Conversation

@erikherz
Copy link
Contributor

@erikherz erikherz commented Mar 21, 2026

Bug: Stale namespace persists in gossip mesh after publisher disconnect

Summary

In a full-mesh cluster (dev branch), a namespace announced via gossip survives publisher disconnects and relay restarts. Subscribers receive SUBSCRIBE_OK followed by PUBLISH_DONE status=500 because the relay routes to a gossip peer that has no actual data.

Environment

  • Branch: dev
  • Cluster: 6 relays in full mesh ([cluster] connect lists all peers)
  • Publisher: moq-lite-02 client connecting to one relay (the "origin")
  • Subscriber: browser via moq-transport-14 / WebTransport

Steps to Reproduce

  1. Start all 6 relays with full-mesh config
  2. Start publisher → connects to relay A, announces namespace test
  3. Verify: subscriber connects to any relay, plays successfully
  4. Stop the publisher
  5. Stop relay A (the origin)
  6. Restart relay A
  7. Subscriber connects to relay A → gets SUBSCRIBE_OK then PUBLISH_DONE status=500

The namespace test is stuck. Restarting relay A alone does not fix it. Only restarting ALL relays within ~10 seconds clears the stale entry.

Root Cause Analysis

How gossip propagates the namespace

When relay A's publisher announces test:

  1. Relay A stores it locally via publish_broadcast() (rs/moq-lite/src/model/origin.rs:430)
  2. Relay A's gossip connections publish it to peers B-F via run_remote_once() (rs/moq-relay/src/cluster.rs:108): .with_publish(self.origin.consume())
  3. Peers B-F each store test in their own origin and re-publish to their peers
  4. All 6 relays now have test in their origin

How cleanup should work

When the publisher disconnects:

  1. The BroadcastConsumer on relay A closes
  2. The spawned task in publish_broadcast() (origin.rs:444) awaits broadcast.closed(), then calls remove()unannounce()
  3. The unannounce should propagate to peers

Why it fails

When relay A restarts:

  1. Relay A's local broadcast is gone (process died)
  2. Peers B-F still have test in their origins (received via gossip)
  3. Relay A reconnects to peers via run_remote_once() (cluster.rs:108)
  4. Each peer publishes test back to relay A via .with_publish(self.origin.consume())
  5. Relay A now has test again — but it came from gossip, not a real publisher
  6. Subscriber connects, relay A finds test via gossip, tries to fetch from a peer, peer has no data → PUBLISH_DONE 500

The broadcast.closed() cleanup on peers should fire when the gossip connection to relay A drops. But the reconnection happens immediately (backoff starts at 1s in cluster.rs:94), and the new connection re-publishes before the old consumer's close task runs.

The REANNOUNCE_HOLD_DOWN of 250ms (origin.rs:15) is designed for cascading closures, but the reconnect+re-publish race is faster.

Key code paths

File Line Description
rs/moq-lite/src/model/origin.rs 430 publish_broadcast() — stores broadcast, spawns close watcher
rs/moq-lite/src/model/origin.rs 444 broadcast.closed().await — cleanup on consumer close
rs/moq-lite/src/model/origin.rs 445 root.lock().remove() → triggers unannounce
rs/moq-lite/src/model/origin.rs 15 REANNOUNCE_HOLD_DOWN = 250ms
rs/moq-relay/src/cluster.rs 108-122 run_remote_once() — connects to peer, publishes+consumes full origin
rs/moq-relay/src/cluster.rs 94 Backoff starts at 1s, doubles on failure

The race condition

Time 0:     Relay A dies. Connection to peer B drops.
Time 0+ε:   Peer B's BroadcastConsumer for "test" should start closing
Time 1s:    Peer B reconnects to relay A (backoff=1s)
Time 1s+ε:  Peer B publishes "test" to relay A via .with_publish(self.origin.consume())
Time ???:   Peer B's old consumer close task finally runs → unannounce
            But "test" was already re-published via the new connection

The gossip consumer on peer B is a different BroadcastConsumer than the original publisher's. When peer B's connection to relay A drops, peer B's consumer for the relay A session closes. But origin.rs has backup logic — the same namespace from other gossip paths (via peers C-F) keeps the broadcast alive on peer B.

Possible Fixes

  1. Reconnection delay: Wait for REANNOUNCE_HOLD_DOWN (or longer) after a peer disconnects before re-publishing its namespaces on a new connection. This gives the close task time to propagate.

  2. Hop-aware cleanup: When a gossip connection drops, immediately unannounce all namespaces that were learned exclusively through that connection (no other path with equal or fewer hops).

  3. Publisher heartbeat: Require the original publisher to periodically refresh its announce. If no refresh within N seconds, the namespace is unannounced globally.

  4. Admin purge: Add a relay CLI command or HTTP endpoint to purge a namespace from the mesh.

Logs

Subscriber sees repeated PUBLISH_NAMESPACE from every gossip peer, then 500:

[MoQT] SUBSCRIBE id=0 track="catalog"
[MoQT] PUBLISH_NAMESPACE id=1 ns="test" — sending OK
[MoQT] SUBSCRIBE_OK id=0 alias=0 track="catalog"
[MoQT] PUBLISH_DONE id=0 status=500
[MoQT] PUBLISH_NAMESPACE id=3 ns="test" — sending OK
[MoQT] PUBLISH_NAMESPACE id=5 ns="test" — sending OK
[MoQT] PUBLISH_NAMESPACE id=7 ns="test" — sending OK
[MoQT] PUBLISH_NAMESPACE id=9 ns="test" — sending OK
...

Edge relay during origin restart — disconnect, reconnect, timeout cycle:

WARN conn{id=56}: transport error err=session error: connection error: closed
INFO remote{remote=origin}: connecting to remote url=https://origin/
WARN remote{remote=origin}: QUIC connection failed err=timed out
WARN remote{remote=peer}: transport error err=session error: connection error: closed
INFO remote{remote=peer}: connecting to remote url=https://peer/
WARN remote{remote=peer}: QUIC connection failed err=timed out

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 21, 2026

Walkthrough

A new documentation file GOSSIP_STALE_NAMESPACE_BUG.md has been added to describe a bug in a full-mesh relay cluster. The file documents an issue where a namespace announced via gossip becomes stuck after the original publisher disconnects, and partial relay restarts fail to clear it. The documentation includes the affected environment configuration, step-by-step reproduction instructions, observed subscriber behavior (SUBSCRIBE_OK followed by PUBLISH_DONE status=500), expected cleanup paths, contrasting failing behavior after relay restart, a race condition timeline, and multiple potential fix approaches with representative log snippets demonstrating the issue.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive No description was provided by the author. This is a lenient check and the lack of description alone does not cause a failure, though documentation is minimal. Consider adding a brief description explaining the bug report's purpose, reproduction steps, or expected outcomes to improve clarity for reviewers.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and specifically describes the main change: documenting a bug where stale namespaces persist in the gossip mesh after a publisher disconnects.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Important

Merge conflicts detected (Beta)

  • Resolve merge conflict in branch gossip-stale-namespace-bug
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
✨ Simplify code
  • Create PR with simplified code

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can get early access to new features in CodeRabbit.

Enable the early_access setting to enable early access features such as new models, tools, and more.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@GOSSIP_STALE_NAMESPACE_BUG.md`:
- Around line 95-105: The markdown contains untyped code fences for log output
(the fence that starts with "[MoQT] SUBSCRIBE id=0 track=\"catalog\"" and the
fence that starts with "WARN conn{id=56}: transport error..."); update those
fences to use a language tag of text (i.e., change ``` to ```text) for both the
block around lines ~95 and the block around lines ~109-116 so the logs are
treated as plain text and satisfy MD040.
- Around line 70-77: The fenced code block in the GOSSIP_STALE_NAMESPACE_BUG.md
snippet is missing a language tag (triggers markdownlint MD040); update the
triple-backtick fence that contains the timeline lines starting with "Time 0:"
to include a language identifier (e.g., change ``` to ```text) so the block is
explicitly typed and linting passes—locate the timeline block in the file (the
one with "Time 0: Relay A dies..." through "But "test" was already re-published
via the new connection") and add the language tag to the opening fence.
- Around line 11-24: Update the reproduction steps to explicitly state the
effective broadcast path the browser subscriber sees (e.g., use "anon/test" vs
"test") wherever the namespace `test` is referenced; clarify in the Steps to
Reproduce and the Publisher/Subscriber bullets that moq-relay's WebTransport
prepends URL paths (e.g., /anon) while raw QUIC/iroh publishers must include
that prefix themselves, so readers know to check the actual broadcast name the
browser sees and avoid false-negative repros.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6046b5e8-a0cb-4127-9db6-97bf0315db08

📥 Commits

Reviewing files that changed from the base of the PR and between 64cbba8 and c678431.

📒 Files selected for processing (1)
  • GOSSIP_STALE_NAMESPACE_BUG.md

Comment on lines +11 to +24
- Publisher: moq-lite-02 client connecting to one relay (the "origin")
- Subscriber: browser via moq-transport-14 / WebTransport

## Steps to Reproduce

1. Start all 6 relays with full-mesh config
2. Start publisher → connects to relay A, announces namespace `test`
3. Verify: subscriber connects to any relay, plays successfully
4. Stop the publisher
5. Stop relay A (the origin)
6. Restart relay A
7. Subscriber connects to relay A → gets SUBSCRIBE_OK then PUBLISH_DONE status=500

The namespace `test` is stuck. Restarting relay A alone does not fix it. Only restarting ALL relays within ~10 seconds clears the stale entry.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Clarify effective namespace/path to avoid false-negative reproduction

Please explicitly state the effective broadcast name seen by the browser subscriber (e.g., anon/test vs test). Without this, reproduction can be inconsistent when publisher transport differs from WebTransport.

✍️ Suggested doc patch
 - Publisher: moq-lite-02 client connecting to one relay (the "origin")
 - Subscriber: browser via moq-transport-14 / WebTransport
+ - Namespace note: WebTransport sessions prepend URL path prefixes (for example `/anon`),
+   so browser-visible namespace may be `anon/test` even if publisher announces `test`.
+   If publishing via raw QUIC/iroh, include the prefix manually when needed.

Based on learnings: In moq-relay, WebTransport automatically prepends URL paths (e.g., /anon), while raw QUIC/iroh publishers must manually add that prefix to broadcast names.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- Publisher: moq-lite-02 client connecting to one relay (the "origin")
- Subscriber: browser via moq-transport-14 / WebTransport
## Steps to Reproduce
1. Start all 6 relays with full-mesh config
2. Start publisher → connects to relay A, announces namespace `test`
3. Verify: subscriber connects to any relay, plays successfully
4. Stop the publisher
5. Stop relay A (the origin)
6. Restart relay A
7. Subscriber connects to relay A → gets SUBSCRIBE_OK then PUBLISH_DONE status=500
The namespace `test` is stuck. Restarting relay A alone does not fix it. Only restarting ALL relays within ~10 seconds clears the stale entry.
- Publisher: moq-lite-02 client connecting to one relay (the "origin")
- Subscriber: browser via moq-transport-14 / WebTransport
- Namespace note: WebTransport sessions prepend URL path prefixes (for example `/anon`),
so browser-visible namespace may be `anon/test` even if publisher announces `test`.
If publishing via raw QUIC/iroh, include the prefix manually when needed.
## Steps to Reproduce
1. Start all 6 relays with full-mesh config
2. Start publisher → connects to relay A, announces namespace `test`
3. Verify: subscriber connects to any relay, plays successfully
4. Stop the publisher
5. Stop relay A (the origin)
6. Restart relay A
7. Subscriber connects to relay A → gets SUBSCRIBE_OK then PUBLISH_DONE status=500
The namespace `test` is stuck. Restarting relay A alone does not fix it. Only restarting ALL relays within ~10 seconds clears the stale entry.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@GOSSIP_STALE_NAMESPACE_BUG.md` around lines 11 - 24, Update the reproduction
steps to explicitly state the effective broadcast path the browser subscriber
sees (e.g., use "anon/test" vs "test") wherever the namespace `test` is
referenced; clarify in the Steps to Reproduce and the Publisher/Subscriber
bullets that moq-relay's WebTransport prepends URL paths (e.g., /anon) while raw
QUIC/iroh publishers must include that prefix themselves, so readers know to
check the actual broadcast name the browser sees and avoid false-negative
repros.

Comment on lines +70 to +77
```
Time 0: Relay A dies. Connection to peer B drops.
Time 0+ε: Peer B's BroadcastConsumer for "test" should start closing
Time 1s: Peer B reconnects to relay A (backoff=1s)
Time 1s+ε: Peer B publishes "test" to relay A via .with_publish(self.origin.consume())
Time ???: Peer B's old consumer close task finally runs → unannounce
But "test" was already re-published via the new connection
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a language tag to this fenced code block

Line 70 uses an untyped code fence, which triggers markdownlint MD040.

✅ Suggested fix
-```
+```text
 Time 0:     Relay A dies. Connection to peer B drops.
 Time 0+ε:   Peer B's BroadcastConsumer for "test" should start closing
 Time 1s:    Peer B reconnects to relay A (backoff=1s)
 Time 1s+ε:  Peer B publishes "test" to relay A via .with_publish(self.origin.consume())
 Time ???:   Peer B's old consumer close task finally runs → unannounce
             But "test" was already re-published via the new connection
</details>

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 markdownlint-cli2 (0.21.0)</summary>

[warning] 70-70: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @GOSSIP_STALE_NAMESPACE_BUG.md around lines 70 - 77, The fenced code block in
the GOSSIP_STALE_NAMESPACE_BUG.md snippet is missing a language tag (triggers
markdownlint MD040); update the triple-backtick fence that contains the timeline
lines starting with "Time 0:" to include a language identifier (e.g., change totext) so the block is explicitly typed and linting passes—locate the
timeline block in the file (the one with "Time 0: Relay A dies..." through "But
"test" was already re-published via the new connection") and add the language
tag to the opening fence.


</details>

<!-- fingerprinting:phantom:triton:hawk -->

<!-- This is an auto-generated comment by CodeRabbit -->

Comment on lines +95 to +105
```
[MoQT] SUBSCRIBE id=0 track="catalog"
[MoQT] PUBLISH_NAMESPACE id=1 ns="test" — sending OK
[MoQT] SUBSCRIBE_OK id=0 alias=0 track="catalog"
[MoQT] PUBLISH_DONE id=0 status=500
[MoQT] PUBLISH_NAMESPACE id=3 ns="test" — sending OK
[MoQT] PUBLISH_NAMESPACE id=5 ns="test" — sending OK
[MoQT] PUBLISH_NAMESPACE id=7 ns="test" — sending OK
[MoQT] PUBLISH_NAMESPACE id=9 ns="test" — sending OK
...
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add language tags to log code fences

Line 95 and Line 109 also use untyped fences (MD040). Use text for logs.

✅ Suggested fix
-```
+```text
 [MoQT] SUBSCRIBE id=0 track="catalog"
 [MoQT] PUBLISH_NAMESPACE id=1 ns="test" — sending OK
 [MoQT] SUBSCRIBE_OK id=0 alias=0 track="catalog"
 [MoQT] PUBLISH_DONE id=0 status=500
 ...

```diff
-```
+```text
 WARN conn{id=56}: transport error err=session error: connection error: closed
 INFO remote{remote=origin}: connecting to remote url=https://origin/
 WARN remote{remote=origin}: QUIC connection failed err=timed out
 ...
</details>




Also applies to: 109-116

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 markdownlint-cli2 (0.21.0)</summary>

[warning] 95-95: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @GOSSIP_STALE_NAMESPACE_BUG.md around lines 95 - 105, The markdown contains
untyped code fences for log output (the fence that starts with "[MoQT] SUBSCRIBE
id=0 track="catalog"" and the fence that starts with "WARN conn{id=56}:
transport error..."); update those fences to use a language tag of text (i.e.,
change totext) for both the block around lines ~95 and the block around
lines ~109-116 so the logs are treated as plain text and satisfy MD040.


</details>

<!-- fingerprinting:phantom:triton:hawk -->

<!-- This is an auto-generated comment by CodeRabbit -->

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant