DistArray::lazy_deleter: skip lazy_sync when invoked from fence's do_cleanup#551
Merged
evaleev merged 2 commits intoMay 21, 2026
Merged
Conversation
…cleanup Use the new MADNESS `WorldGopInterface::is_in_do_cleanup()` flag to short-circuit the cross-rank `lazy_sync` handshake when `lazy_deleter` is called from inside `fence_impl`'s deferred-cleanup phase: `delete pimpl` directly, decrement `cleanup_counter_`, return. Why it is safe: - `fence_impl` runs the global-termination protocol before calling `deferred_->do_cleanup()`, so all ranks are at the same point with no AM in flight. - `defer_deleter_to_next_fence` is, by contract, used collectively, so every rank's deferred list holds the same set of pimpls at this point and every rank performs the matching delete in lockstep. - The `lazy_sync` handshake exists to guarantee that no peer is still about to send AM addressed to this object before we delete it; the fence already establishes that. Why it matters: the original `lazy_sync` path enqueues a `lazy_sync_children` task on this world's taskq *after* the fence's drain loop has exited. Such tasks survive the fence and are picked up later by some other fence that drives the global ThreadPool. If the world is destroyed in the meantime (e.g. einsum's per-Hadamard sub-Worlds torn down at function exit or during exception unwind), the stranded task runs `delete pimpl` against a world whose taskq / gop are already freed; `~WorldObject` then trips its `World::exists(&world)` assertion and aborts, masking any real error. The fast path avoids ever scheduling that task. The general (non-deferred) path is unchanged: `lazy_deleter` invoked outside `do_cleanup` still goes through `lazy_sync` because we cannot rely on synchronization with peers in that case.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Use MADNESS's new
WorldGopInterface::is_in_do_cleanup()flag to short-circuit the cross-ranklazy_synchandshake whenlazy_deleteris called from insidefence_impl's deferred-cleanup phase:delete pimpldirectly, decrementcleanup_counter_, return.Also bumps the MADNESS pin to m-a-d-n-e-s-s/madness#695 to pick up the flag.
Why
lazy_sync's purpose inlazy_deleteris to keep a peer from sending AM addressed to aWorldObjectafter the local rank has deleted it. Whenlazy_deleterruns insidefence_impl'sdo_cleanup:fence_impl's global-termination protocol (drain loop + Dykstra-style nsent/nrecv tree handshake) has already established that no AM is in flight and all ranks are at the same syntactic point.defer_deleter_to_next_fenceis, by contract, used collectively, so every rank's deferred list holds the same pimpls at this point and every rank reaches the samedeletein lockstep.The handshake is therefore redundant — and actively harmful. The
lazy_sync_childrentask it schedules lives on this world's taskq after the drain loop has exited. That task survived the fence and waited for matchinglazy_synccalls that could come arbitrarily later — or not at all, if the world was torn down first. The most visible symptom: ephemeral worlds (e.g. TA-einsum's per-Hadamard sub-Worlds) being torn down at function exit or during exception unwind, leaving the task to be picked up by some unrelated later fence which runsdelete pimplagainst a freed world.~WorldObjectthen trips itsWorld::exists(&world)assertion and aborts.The general (non-deferred) path is unchanged:
lazy_deleterinvoked outsidedo_cleanupstill routes throughlazy_syncbecause we cannot rely on synchronization with peers in that case.What
external/versions.cmake: bump pinned MADNESS to666765ca6.src/TiledArray/array_impl.h: inlazy_deleter, after the existingworld.await(num_live_ds == 0), checkworld.gop.is_in_do_cleanup(); if set, delete directly.Test plan
~WorldObjectno longer needs any einsum-side workaround to run end-to-end (verified by toggling).Follow-up
ValeevGroup/tiledarray#550 builds on this with a smaller
einsumRAII guard and the(outer Contraction, inner Hadamard)view-cell case incont_engine.