fuzz: cover deferred writing in chanmon_consistency#4465
Conversation
|
👋 I see @wpaulino was un-assigned. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4465 +/- ##
==========================================
- Coverage 86.12% 86.11% -0.02%
==========================================
Files 157 157
Lines 108824 108871 +47
Branches 108824 108871 +47
==========================================
+ Hits 93727 93755 +28
- Misses 12480 12501 +21
+ Partials 2617 2615 -2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
51afc25 to
0aebb10
Compare
|
Rebased |
0aebb10 to
6274ba0
Compare
|
Prerequisites are in, rebased. |
|
This LGTM, why is it draft? |
|
I first wanted to do a serious local run, but then it turned out there are so many pre-existing fuzz failures that it is hard to see what's new. I've bisected the failures to the various PRs that introduced them. |
|
Although for this PR I could just see if anything pops up that doesnt repro on main. Will do that. |
|
Several newly introduced fuzz failures to address: |
|
Zoomed in on one of those sequences, and it seems it is reproducible with another string without deferred mode too. |
c6da16e to
d4bf3e0
Compare
|
Dependency: #4520 |
d4bf3e0 to
a798d74
Compare
|
Interestingly the non-deferred mode reproducer It seems the increased headroom made the bug unobservable to the fuzzer? I ran the fuzzer on this branch overnight, and no failures were found. |
|
I'll try to add a new invariant to the fuzzer so it catches the problem in a more robust way. |
|
Invariant addition: #4601 |
|
Will rebase this after #4571 |
| /// This simulates the pattern of snapshotting the pending count, persisting the | ||
| /// `ChannelManager`, then flushing the queued monitor writes. |
There was a problem hiding this comment.
Nit: The docstring says this "simulates the pattern of snapshotting the pending count, persisting the ChannelManager, then flushing the queued monitor writes." However, in the actual fuzz loop, the flush happens during event processing (called from release_pending_monitor_events), while the ChannelManager is serialized at the end of each loop iteration (lines 3023-3031) — i.e., after the flush, not before.
This means the fuzzer always tests the scenario where the ChannelManager snapshot captures post-flush state. The more interesting crash scenario — serializing the ChannelManager while monitor writes are still queued, then crashing before the flush — is not directly exercised by this ordering. (The fuzzer does partially cover stale-monitor restarts through the use_old_mons selection in reload_node, but that's a different axis.)
Consider either updating the docstring to reflect what actually happens, or restructuring so the ChannelManager is serialized at the flush call-site to match the documented pattern.
| if persister_res == chain::ChannelMonitorUpdateStatus::Completed { | ||
| for (_channel_id, state) in self.latest_monitors.lock().unwrap().iter_mut() { | ||
| if let Some((id, data)) = state.pending_monitors.drain(..).last() { | ||
| state.persisted_monitor_id = id; | ||
| state.persisted_monitor = data; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
When persister_res == Completed, this drains pending_monitors for all channels, not just those whose operations were included in the current flush(count) call. This is correct today because update_ret is constant for the lifetime of a node (set at creation/reload, never mutated by the mon_style fuzz commands at 0x00-0x06). That invariant ensures:
- If
update_ret == Completed: every prior flush also usedCompleted, so there are no "orphaned" pending entries from an earlierInProgressflush. - If
update_ret == InProgress: this branch is never taken.
But this is a subtle invariant — if a future change allows update_ret to be toggled mid-run (e.g., fuzz commands updating the live persister), this would silently promote monitors whose corresponding channel_monitor_updated calls were never made by the flush. Consider adding a comment documenting this assumption, or tightening the logic to only promote monitors that correspond to channels with operations in the flushed batch.
|
I've completed a thorough re-review of the entire PR diff, including:
No new issues found beyond what was already noted in the prior review pass. Prior review status:
No issues found. |
a798d74 to
20275f6
Compare
| fn mark_update_completed( | ||
| &self, channel_id: ChannelId, monitor_id: u64, serialized_monitor: Vec<u8>, | ||
| ) { | ||
| if let Some(state) = self.latest_monitors.lock().unwrap().get_mut(&channel_id) { | ||
| // Once LDK acknowledges update N as completed, any older pending monitor blob is fully | ||
| // superseded and must not be offered back on restart. | ||
| state.mark_persisted(monitor_id, serialized_monitor); | ||
| } |
There was a problem hiding this comment.
Bug: mark_persisted removes all pending entries with id <= monitor_id from the shadow's pending_monitors (via retain), but only the single selected entry was acknowledged with ChainMonitor::channel_monitor_updated. The older entries that are removed will never be acknowledged.
ChainMonitor::channel_monitor_updated removes only the specific completed_update_id from its internal pending_monitor_updates list, and only emits MonitorEvent::Completed when all pending updates have been individually acknowledged.
Regression from old code: The old complete_monitor_update only removed the selected entry (via remove(0), remove(1), or pop()) and left other entries intact for future completion. The new code's reuse of mark_persisted here also cleans up older entries that still need ChainMonitor acknowledgment.
Trigger scenario:
persistence_style = InProgress, two updates accumulate:pending_monitors = [(3, data3), (5, data5)]- Fuzz command
0xf2/0xf6/0xfa/0xfecallscomplete_monitor_update(Last) take_pending(Last)removes(5, data5)→pending_monitors = [(3, data3)]channel_monitor_updated(5)— ChainMonitor removes 5, still has 3 pendingmark_persisted(5, data5)—retainremoves(3, data3)→pending_monitors = []- Shadow has no record of update 3, ChainMonitor still has 3 pending forever
MonitorEvent::Completedis never emitted → channel operations blocked →process_all_eventspanics at 100 iterations
Fix: mark_update_completed should only update persisted_monitor_id/persisted_monitor without removing other pending entries. The retain cleanup in mark_persisted is correct for track_monitor_update (persister callback context) but wrong here (explicit completion context). Consider splitting the logic or adding a dedicated method that doesn't retain-prune.
af7e8ee to
53cd7c9
Compare
53cd7c9 to
e72ec58
Compare
Replace the chanmon consistency harness' Watch wrapper with a Persist implementation backed by HarnessPersister. Monitor writes now flow through the real ChainMonitor persistence hooks. Track restart candidates separately from monitor completion callbacks. A monitor can stop being a valid reload candidate once a newer baseline is durable, while its callback may still be needed to unblock the live ChainMonitor. On reload, choose the durable baseline, first pending snapshot, or last pending snapshot. Startup monitor registration completes immediately before the configured persistence style is restored.
Enable deferred ChainMonitor writes in chanmon_consistency. Checkpoint the ChannelManager before flushing captured monitor writes. Treat checkpoint-only progress as progress during settle_all.
e72ec58 to
5eb96ab
Compare
|
I reworked this PR and based it on a preparatory commit that makes the fuzzer's persistence model much more understandable, in my opinion. The prep work was taken from the force-close fuzzing PR #4381 that is still WIP, so it should be useful there as well. |
Adds fuzz coverage for #4351