Add CallEdgeTracker trait for distributed deadlock detection#4735
Open
cloutiertyler wants to merge 28 commits intojdetter/tpccfrom
Open
Add CallEdgeTracker trait for distributed deadlock detection#4735cloutiertyler wants to merge 28 commits intojdetter/tpccfrom
cloutiertyler wants to merge 28 commits intojdetter/tpccfrom
Conversation
Before making a cross-database reducer call, register an edge A -> B with a CallEdgeTracker. If adding the edge would create a cycle (distributed deadlock), the tracker returns an error. The caller retries with exponential backoff (5 attempts), then fails with a deadlock error. New trait CallEdgeTracker in core::host::call_edge_tracker with: - register_edge(call_id, caller, callee) -> Result<()> - unregister_edge(call_id) -> Result<()> - unregister_all_edges() -> Result<()> (crash cleanup on startup) NoopCallEdgeTracker for standalone (always allows calls). Cloud implementation will call control DB reducers for cycle detection. Also added register/unregister_reducer_call_edge methods to ControlStateWriteAccess trait (no-op in standalone).
d0cbf19 to
6147323
Compare
Edge tracking uses CallEdgeTracker trait (in core) instead of ControlStateWriteAccess (in client-api) due to circular dependency. Added TODO to consolidate once the trait is moved to a shared crate.
Convert call_reducer_on_db and call_reducer_on_db_2pc from async to synchronous blocking HTTP. This avoids async runtime conflicts on the WASM executor thread. - Add reqwest::blocking::Client to ReplicaContext - Add execute_blocking_http helper (runs on fresh OS thread) - Add resolve_base_url_blocking to ReducerCallRouter - Make CallEdgeTracker methods synchronous - Enable reqwest "blocking" feature
Add call_edge_tracker field to HostController and ModuleLauncher so the tracker flows from the top-level Node/StandaloneEnv down to each ReplicaContext. Added set_call_edge_tracker method for runtime configuration.
- ReducerCallRouter::resolve_base_url now returns Result<String> directly (blocking) instead of BoxFuture. All implementations are synchronous. - HostController uses OnceLock for the router (set once at startup, lock-free reads afterward). Falls back to LocalReducerRouter default. - Removed async BoxFuture and resolve_base_url_blocking variant.
After the reducer commit releases the lock, modify the first pending TxData in the barrier queue to include the st_2pc_state deletion. When the barrier clears, a single commitlog entry contains both the reducer's row changes and the COMMIT marker (st_2pc_state delete). The st_2pc_state row never enters committed_state during normal operation -- it only exists in the commitlog for crash recovery.
Replace NoopCallEdgeTracker with InMemoryCallEdgeTracker that maintains an in-memory adjacency list of active call edges and runs DFS cycle detection on each registration. Works for standalone where all databases share the same process.
execute_blocking_http now takes a RequestBuilder instead of a built Request. Both build() and execute() happen inside the scoped OS thread, which has no tokio context. In debug builds, reqwest 0.12 panics if blocking I/O operations run inside a tokio block_on context.
modify_first_barrier_pending used Arc::get_mut which always failed because the committing code still held a reference to the same Arc. This meant the st_2pc_state DELETE was silently dropped, causing "Delete for non-existent row" crashes on commit log replay after a restart. Fix: derive Clone for TxData and use Arc::make_mut, which does copy-on-write when the Arc is shared.
The DELETE entry for st_2pc_state was constructed with empty placeholder fields (only prepare_id set). During transaction replay, delete_equal_row uses whole-row equality via eq_row_in_page, so the empty-field DELETE never matched the full-field INSERT, causing "Delete for non-existent row" errors that bricked the database on restart. Build the St2pcStateRow once and reuse it for both the INSERT marker and the DELETE entry so they match exactly during replay.
…module maybe_create_schedule was calling the blocking connect_metrics_module while holding the parking_lot::Mutex, inside an async Axum handler. If the connection timed out or failed, .unwrap() panicked with the lock held, leaving the schedule unset and returning a 500 to the last driver to register — causing it to exhaust its retry attempts and fail. Switch to connect_metrics_module_async and release the mutex before the network call, re-acquiring it only to write the final schedule. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When multiple concurrent 2PC transactions share a participant database, each sets its own durability barrier. If one aborts, the old code only dropped pending transactions when ALL barriers were gone. Otherwise, tainted TxData from the aborted 2PC stayed in the pending list and got flushed when a later barrier cleared, writing corrupted data to the commitlog. Since abort is always followed by a module restart that rebuilds committed state from disk, unconditionally drop all barriers and all pending transactions. Other in-flight 2PC async tasks will find the barrier already gone and no-op, which is correct since the module is about to restart.
With concurrent 2PC transactions on the same participant, modify_first_barrier_pending would add the st_2pc_state DELETE to the wrong TxData entry. The pending queue could contain entries from multiple 2PC transactions, and first() picked the oldest one rather than the current transactions reducer commit. This caused the DELETE to be written to the commitlog in a different transaction than the INSERT, leading to Delete for non-existent row during replay. Replace modify_first_barrier_pending with modify_barrier_pending_at which finds the entry by its tx_offset (barrier_offset + 1, the reducer commit offset assigned while the write lock was held).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
CallEdgeTrackertrait for tracking cross-database call edges (A -> B)NoopCallEdgeTrackerfor standalone (no-op, always succeeds)call_reducer_on_dbandcall_reducer_on_db_2pcNodesError::CycleDetectedanderrno::CYCLE_DETECTED(22) for wasm ABIDesign
When database A calls a reducer on database B, the edge A -> B is registered before the HTTP request is sent. If
register_edgedetects a cycle, it returnsCycleDetectedand the call is aborted before it can deadlock. After the call completes, the edge is unregistered.The
CallEdgeTrackeris stored onReplicaContextso both the actor code and HTTP handlers can access it.Test plan