Bug report: stale namespace persists in gossip mesh after publisher disconnect#1148
Bug report: stale namespace persists in gossip mesh after publisher disconnect#1148erikherz wants to merge 1 commit intomoq-dev:devfrom
Conversation
WalkthroughA new documentation file 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Important Merge conflicts detected (Beta)
✨ Finishing Touches🧪 Generate unit tests (beta)
✨ Simplify code
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip You can get early access to new features in CodeRabbit.Enable the |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@GOSSIP_STALE_NAMESPACE_BUG.md`:
- Around line 95-105: The markdown contains untyped code fences for log output
(the fence that starts with "[MoQT] SUBSCRIBE id=0 track=\"catalog\"" and the
fence that starts with "WARN conn{id=56}: transport error..."); update those
fences to use a language tag of text (i.e., change ``` to ```text) for both the
block around lines ~95 and the block around lines ~109-116 so the logs are
treated as plain text and satisfy MD040.
- Around line 70-77: The fenced code block in the GOSSIP_STALE_NAMESPACE_BUG.md
snippet is missing a language tag (triggers markdownlint MD040); update the
triple-backtick fence that contains the timeline lines starting with "Time 0:"
to include a language identifier (e.g., change ``` to ```text) so the block is
explicitly typed and linting passes—locate the timeline block in the file (the
one with "Time 0: Relay A dies..." through "But "test" was already re-published
via the new connection") and add the language tag to the opening fence.
- Around line 11-24: Update the reproduction steps to explicitly state the
effective broadcast path the browser subscriber sees (e.g., use "anon/test" vs
"test") wherever the namespace `test` is referenced; clarify in the Steps to
Reproduce and the Publisher/Subscriber bullets that moq-relay's WebTransport
prepends URL paths (e.g., /anon) while raw QUIC/iroh publishers must include
that prefix themselves, so readers know to check the actual broadcast name the
browser sees and avoid false-negative repros.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 6046b5e8-a0cb-4127-9db6-97bf0315db08
📒 Files selected for processing (1)
GOSSIP_STALE_NAMESPACE_BUG.md
| - Publisher: moq-lite-02 client connecting to one relay (the "origin") | ||
| - Subscriber: browser via moq-transport-14 / WebTransport | ||
|
|
||
| ## Steps to Reproduce | ||
|
|
||
| 1. Start all 6 relays with full-mesh config | ||
| 2. Start publisher → connects to relay A, announces namespace `test` | ||
| 3. Verify: subscriber connects to any relay, plays successfully | ||
| 4. Stop the publisher | ||
| 5. Stop relay A (the origin) | ||
| 6. Restart relay A | ||
| 7. Subscriber connects to relay A → gets SUBSCRIBE_OK then PUBLISH_DONE status=500 | ||
|
|
||
| The namespace `test` is stuck. Restarting relay A alone does not fix it. Only restarting ALL relays within ~10 seconds clears the stale entry. |
There was a problem hiding this comment.
Clarify effective namespace/path to avoid false-negative reproduction
Please explicitly state the effective broadcast name seen by the browser subscriber (e.g., anon/test vs test). Without this, reproduction can be inconsistent when publisher transport differs from WebTransport.
✍️ Suggested doc patch
- Publisher: moq-lite-02 client connecting to one relay (the "origin")
- Subscriber: browser via moq-transport-14 / WebTransport
+ - Namespace note: WebTransport sessions prepend URL path prefixes (for example `/anon`),
+ so browser-visible namespace may be `anon/test` even if publisher announces `test`.
+ If publishing via raw QUIC/iroh, include the prefix manually when needed.Based on learnings: In moq-relay, WebTransport automatically prepends URL paths (e.g., /anon), while raw QUIC/iroh publishers must manually add that prefix to broadcast names.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - Publisher: moq-lite-02 client connecting to one relay (the "origin") | |
| - Subscriber: browser via moq-transport-14 / WebTransport | |
| ## Steps to Reproduce | |
| 1. Start all 6 relays with full-mesh config | |
| 2. Start publisher → connects to relay A, announces namespace `test` | |
| 3. Verify: subscriber connects to any relay, plays successfully | |
| 4. Stop the publisher | |
| 5. Stop relay A (the origin) | |
| 6. Restart relay A | |
| 7. Subscriber connects to relay A → gets SUBSCRIBE_OK then PUBLISH_DONE status=500 | |
| The namespace `test` is stuck. Restarting relay A alone does not fix it. Only restarting ALL relays within ~10 seconds clears the stale entry. | |
| - Publisher: moq-lite-02 client connecting to one relay (the "origin") | |
| - Subscriber: browser via moq-transport-14 / WebTransport | |
| - Namespace note: WebTransport sessions prepend URL path prefixes (for example `/anon`), | |
| so browser-visible namespace may be `anon/test` even if publisher announces `test`. | |
| If publishing via raw QUIC/iroh, include the prefix manually when needed. | |
| ## Steps to Reproduce | |
| 1. Start all 6 relays with full-mesh config | |
| 2. Start publisher → connects to relay A, announces namespace `test` | |
| 3. Verify: subscriber connects to any relay, plays successfully | |
| 4. Stop the publisher | |
| 5. Stop relay A (the origin) | |
| 6. Restart relay A | |
| 7. Subscriber connects to relay A → gets SUBSCRIBE_OK then PUBLISH_DONE status=500 | |
| The namespace `test` is stuck. Restarting relay A alone does not fix it. Only restarting ALL relays within ~10 seconds clears the stale entry. |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@GOSSIP_STALE_NAMESPACE_BUG.md` around lines 11 - 24, Update the reproduction
steps to explicitly state the effective broadcast path the browser subscriber
sees (e.g., use "anon/test" vs "test") wherever the namespace `test` is
referenced; clarify in the Steps to Reproduce and the Publisher/Subscriber
bullets that moq-relay's WebTransport prepends URL paths (e.g., /anon) while raw
QUIC/iroh publishers must include that prefix themselves, so readers know to
check the actual broadcast name the browser sees and avoid false-negative
repros.
| ``` | ||
| Time 0: Relay A dies. Connection to peer B drops. | ||
| Time 0+ε: Peer B's BroadcastConsumer for "test" should start closing | ||
| Time 1s: Peer B reconnects to relay A (backoff=1s) | ||
| Time 1s+ε: Peer B publishes "test" to relay A via .with_publish(self.origin.consume()) | ||
| Time ???: Peer B's old consumer close task finally runs → unannounce | ||
| But "test" was already re-published via the new connection | ||
| ``` |
There was a problem hiding this comment.
Add a language tag to this fenced code block
Line 70 uses an untyped code fence, which triggers markdownlint MD040.
✅ Suggested fix
-```
+```text
Time 0: Relay A dies. Connection to peer B drops.
Time 0+ε: Peer B's BroadcastConsumer for "test" should start closing
Time 1s: Peer B reconnects to relay A (backoff=1s)
Time 1s+ε: Peer B publishes "test" to relay A via .with_publish(self.origin.consume())
Time ???: Peer B's old consumer close task finally runs → unannounce
But "test" was already re-published via the new connection</details>
<details>
<summary>🧰 Tools</summary>
<details>
<summary>🪛 markdownlint-cli2 (0.21.0)</summary>
[warning] 70-70: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
</details>
</details>
<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against the current code and only fix it if needed.
In @GOSSIP_STALE_NAMESPACE_BUG.md around lines 70 - 77, The fenced code block in
the GOSSIP_STALE_NAMESPACE_BUG.md snippet is missing a language tag (triggers
markdownlint MD040); update the triple-backtick fence that contains the timeline
lines starting with "Time 0:" to include a language identifier (e.g., change totext) so the block is explicitly typed and linting passes—locate the
timeline block in the file (the one with "Time 0: Relay A dies..." through "But
"test" was already re-published via the new connection") and add the language
tag to the opening fence.
</details>
<!-- fingerprinting:phantom:triton:hawk -->
<!-- This is an auto-generated comment by CodeRabbit -->
| ``` | ||
| [MoQT] SUBSCRIBE id=0 track="catalog" | ||
| [MoQT] PUBLISH_NAMESPACE id=1 ns="test" — sending OK | ||
| [MoQT] SUBSCRIBE_OK id=0 alias=0 track="catalog" | ||
| [MoQT] PUBLISH_DONE id=0 status=500 | ||
| [MoQT] PUBLISH_NAMESPACE id=3 ns="test" — sending OK | ||
| [MoQT] PUBLISH_NAMESPACE id=5 ns="test" — sending OK | ||
| [MoQT] PUBLISH_NAMESPACE id=7 ns="test" — sending OK | ||
| [MoQT] PUBLISH_NAMESPACE id=9 ns="test" — sending OK | ||
| ... | ||
| ``` |
There was a problem hiding this comment.
Add language tags to log code fences
Line 95 and Line 109 also use untyped fences (MD040). Use text for logs.
✅ Suggested fix
-```
+```text
[MoQT] SUBSCRIBE id=0 track="catalog"
[MoQT] PUBLISH_NAMESPACE id=1 ns="test" — sending OK
[MoQT] SUBSCRIBE_OK id=0 alias=0 track="catalog"
[MoQT] PUBLISH_DONE id=0 status=500
...
```diff
-```
+```text
WARN conn{id=56}: transport error err=session error: connection error: closed
INFO remote{remote=origin}: connecting to remote url=https://origin/
WARN remote{remote=origin}: QUIC connection failed err=timed out
...
</details>
Also applies to: 109-116
<details>
<summary>🧰 Tools</summary>
<details>
<summary>🪛 markdownlint-cli2 (0.21.0)</summary>
[warning] 95-95: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
</details>
</details>
<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against the current code and only fix it if needed.
In @GOSSIP_STALE_NAMESPACE_BUG.md around lines 95 - 105, The markdown contains
untyped code fences for log output (the fence that starts with "[MoQT] SUBSCRIBE
id=0 track="catalog"" and the fence that starts with "WARN conn{id=56}:
transport error..."); update those fences to use a language tag of text (i.e.,
change totext) for both the block around lines ~95 and the block around
lines ~109-116 so the logs are treated as plain text and satisfy MD040.
</details>
<!-- fingerprinting:phantom:triton:hawk -->
<!-- This is an auto-generated comment by CodeRabbit -->
Bug: Stale namespace persists in gossip mesh after publisher disconnect
Summary
In a full-mesh cluster (dev branch), a namespace announced via gossip survives publisher disconnects and relay restarts. Subscribers receive SUBSCRIBE_OK followed by PUBLISH_DONE status=500 because the relay routes to a gossip peer that has no actual data.
Environment
dev[cluster] connectlists all peers)Steps to Reproduce
testThe namespace
testis stuck. Restarting relay A alone does not fix it. Only restarting ALL relays within ~10 seconds clears the stale entry.Root Cause Analysis
How gossip propagates the namespace
When relay A's publisher announces
test:publish_broadcast()(rs/moq-lite/src/model/origin.rs:430)run_remote_once()(rs/moq-relay/src/cluster.rs:108):.with_publish(self.origin.consume())testin their own origin and re-publish to their peerstestin their originHow cleanup should work
When the publisher disconnects:
BroadcastConsumeron relay A closespublish_broadcast()(origin.rs:444) awaitsbroadcast.closed(), then callsremove()→unannounce()Why it fails
When relay A restarts:
testin their origins (received via gossip)run_remote_once()(cluster.rs:108)testback to relay A via.with_publish(self.origin.consume())testagain — but it came from gossip, not a real publishertestvia gossip, tries to fetch from a peer, peer has no data → PUBLISH_DONE 500The
broadcast.closed()cleanup on peers should fire when the gossip connection to relay A drops. But the reconnection happens immediately (backoff starts at 1s incluster.rs:94), and the new connection re-publishes before the old consumer's close task runs.The
REANNOUNCE_HOLD_DOWNof 250ms (origin.rs:15) is designed for cascading closures, but the reconnect+re-publish race is faster.Key code paths
rs/moq-lite/src/model/origin.rspublish_broadcast()— stores broadcast, spawns close watcherrs/moq-lite/src/model/origin.rsbroadcast.closed().await— cleanup on consumer closers/moq-lite/src/model/origin.rsroot.lock().remove()→ triggers unannouncers/moq-lite/src/model/origin.rsREANNOUNCE_HOLD_DOWN = 250msrs/moq-relay/src/cluster.rsrun_remote_once()— connects to peer, publishes+consumes full originrs/moq-relay/src/cluster.rsThe race condition
The gossip consumer on peer B is a different
BroadcastConsumerthan the original publisher's. When peer B's connection to relay A drops, peer B's consumer for the relay A session closes. Butorigin.rshas backup logic — the same namespace from other gossip paths (via peers C-F) keeps the broadcast alive on peer B.Possible Fixes
Reconnection delay: Wait for
REANNOUNCE_HOLD_DOWN(or longer) after a peer disconnects before re-publishing its namespaces on a new connection. This gives the close task time to propagate.Hop-aware cleanup: When a gossip connection drops, immediately unannounce all namespaces that were learned exclusively through that connection (no other path with equal or fewer hops).
Publisher heartbeat: Require the original publisher to periodically refresh its announce. If no refresh within N seconds, the namespace is unannounced globally.
Admin purge: Add a relay CLI command or HTTP endpoint to purge a namespace from the mesh.
Logs
Subscriber sees repeated PUBLISH_NAMESPACE from every gossip peer, then 500:
Edge relay during origin restart — disconnect, reconnect, timeout cycle: