Skip to content

[IR Container] Phase 2.7 Segmenter Container Sharing#5983

Open
mdavis36 wants to merge 5 commits intomd/phase2-thread-safetyfrom
md/segmenter-container-sharing
Open

[IR Container] Phase 2.7 Segmenter Container Sharing#5983
mdavis36 wants to merge 5 commits intomd/phase2-thread-safetyfrom
md/segmenter-container-sharing

Conversation

@mdavis36
Copy link
Collaborator

@mdavis36 mdavis36 commented Feb 19, 2026

Summary

Wire the segmenter into Phase 2's shared `IrContainer` infrastructure: segment Fusions now share the complete Fusion's container instead of each creating a new one. This is the integration step that puts real cross-thread lock contention on the `shared_ptr` + `shared_mutex` infrastructure built in Phases 2.1–2.6.

What Changed

`fusion_segmenter.cpp` — `SegmentedFusion::makeFusion()` creates segment Fusions with the shared-container constructor instead of `std::make_unique()`, so all segment Fusions are registered in `completeFusion`'s `IrContainer`.

`fusion.h` / `fusion.cpp` — Add a `protected` constructor `Fusion(shared_ptr)` that creates an empty Fusion registered with a pre-existing container. Used exclusively by `makeFusion()`.

`ir/container.cpp` — Pre-allocate `per_fusion_vals_[fusion]` and `per_fusion_exprs_[fusion]` entries inside `addFusion()` (under the write lock) to prevent unguarded `unordered_map` rehash races when multiple segment Fusions register concurrently during parallel compilation.

`ContainerMutator::removeStatementsCreatedAfter()` (bugfix) — The previous implementation assumed a LIFO pop-back invariant on the global deques that only holds when a single Fusion owns the container. Under shared containers, other Fusions' statements can be interleaved at the tail. Two paths are now implemented:

  • Fast path (single owner): original LIFO pop-back, O(statements added in scope).
  • Slow path (shared container): `std::erase_if` scan — skips another Fusion's statements, preserves this Fusion's pre-guard statements, and removes only this Fusion's statements added during the guard scope. O(total statements in container).

`statement_guard.cpp` — `StatementGuard` switches from `numValsExcludingShortcuts()` to `numVals()`. The old exclusion was a workaround for the LIFO pop-back assuming shortcut vals were at the front; the new ownership-filtered scan preserves pre-guard vals naturally.

`tests/cpp/test_segmentation.cpp` — Two new stress tests exercising the shared container under genuine multi-thread contention:

  • `SharedContainerStress8Segments`: linear chain with 7 `segment_set` boundaries (8 parallel segments)
  • `SharedContainerStress12ParallelBranches`: 4 inputs × 3 independent reductions each (≥6 parallel segments)

Design Note

The fast/slow split in `removeStatementsCreatedAfter` is intentional. The slow path's `std::erase_if` is O(N) in total container size, but it is only reached when the container is shared (i.e., during segmented execution) and only for statement rollback on error paths. The common success path does not call it.

@mdavis36
Copy link
Collaborator Author

!test

@github-actions
Copy link

github-actions bot commented Feb 19, 2026

Review updated until commit b1873c8

Description

  • Add protected Fusion constructor accepting existing shared_ptr for container sharing between segments

  • Fix statement cleanup in shared containers: use scan-based removal instead of LIFO pop when multiple fusions share container

  • Pre-allocate per_fusion maps in addFusion() to prevent rehash races during concurrent segment compilation

  • Add stress tests with 8 and 12 parallel segments to validate shared container under high lock contention

Changes walkthrough

Relevant files
Enhancement
fusion.h
Add shared container Fusion constructor                                   

csrc/fusion.h

  • Add protected Fusion(std::shared_ptr) constructor declaration for
    container sharing
  • Remove numValsExcludingShortcuts() method declaration (no longer
    needed)
  • +4/-7     
    container.cpp
    Pre-allocate per-fusion maps to prevent rehash races         

    csrc/ir/container.cpp

  • Pre-allocate per_fusion_vals_ and per_fusion_exprs_ entries in
    addFusion() to prevent rehashing during concurrent access
  • +2/-0     
    fusion_segmenter.cpp
    Use shared container in segmenter makeFusion                         

    csrc/fusion_segmenter.cpp

  • Use shared container constructor in makeFusion() so segments share
    completeFusion's IrContainer instead of creating new ones
  • +2/-1     
    Bug fix
    fusion.cpp
    Implement shared container constructor and fix statement cleanup

    csrc/fusion.cpp

  • Implement shared-container Fusion constructor
  • Rewrite removeStatementsCreatedAfter to handle shared containers: fast
    path for single fusion (LIFO), slow path for shared (scan-based
    removal)
  • Add nullOutShortcutIfNeeded helper for cleaning up shortcut val caches
  • Remove numValsExcludingShortcuts() implementation
  • +94/-50 
    statement_guard.cpp
    Use numVals instead of numValsExcludingShortcuts                 

    csrc/statement_guard.cpp

  • Change from numValsExcludingShortcuts() to numVals() since shortcut
    handling is now in removeStatementsCreatedAfter
  • +1/-1     
    Tests
    test_segmentation.cpp
    Add stress tests for shared container with 8 and 12 segments

    tests/cpp/test_segmentation.cpp

  • Add SharedContainerStress8Segments: linear chain with 8 segments
    compiling in parallel
  • Add SharedContainerStress12ParallelBranches: 4 inputs x 3 independent
    reductions, 12 parallel segments
  • +84/-0   

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review
    Shortcut val handling change

    The StatementGuard now uses numVals() instead of numValsExcludingShortcuts().
    This is a significant behavioral change - shortcut vals (zero_val_, one_val_, etc.)
    are now included in the count. The nullOutShortcutIfNeeded function handles nulling
    these pointers during removal, but this logic should be verified to ensure shortcut
    vals are correctly tracked in per_fusion_vals_ and vals_up_ and that removal works
    as expected across both fast and slow paths.

    prev_num_exprs_(fusion_->numExprs()),
    prev_num_vals_(fusion_->numVals()) {}
    Slow path performance concern

    The slow path in removeStatementsCreatedAfter uses std::erase_if which is O(total
    statements in container). For shared containers with many statements, this could
    become a performance bottleneck during StatementGuard destruction. Consider if there's
    a more efficient approach or if this is acceptable for the intended use cases.

    } else {
      // Slow path: shared container — other Fusions' statements may be
      // interleaved at the tail of the global deques. Use std::erase_if
      // (C++20) to scan forward: skip the first num_before of self's
      // statements (old, to keep), then erase the remainder (added during
      // the guard scope). Entered whenever the container is shared,
      // regardless of success or failure; if no new statements were added
      // the scan completes trivially. O(total statements in container).
      int64_t exprs_kept = 0;
      std::erase_if(c->exprs_up_, [&](const std::unique_ptr<Expr>& e_up) {
        Expr* e = e_up.get();
        if (c->per_fusion_exprs_[self].count(e) == 0) {
          return false; // belongs to another Fusion — keep
        }
        if (exprs_kept < num_exprs_before) {
          ++exprs_kept;
          return false; // self's old expr — keep
        }
        // self's new expr — remove (clean up uses and index maps first)
        for (Val* out : e->outputs()) {
          out->setDefinition(nullptr);
        }
        for (Val* in : e->inputs()) {
          in->removeUse(e);
        }
        c->per_fusion_exprs_[self].erase(e);
        c->exprs_.erase(e);
        return true;
      });
    
      int64_t vals_kept = 0;
      std::erase_if(c->vals_up_, [&](const std::unique_ptr<Val>& v_up) {
        Val* v = v_up.get();
        if (c->per_fusion_vals_[self].count(v) == 0) {
          return false; // belongs to another Fusion — keep
        }
        if (vals_kept < num_vals_before) {
          ++vals_kept;
          return false; // self's old val — keep
        }
        // self's new val — remove (null shortcut cache pointer if applicable)
        nullOutShortcutIfNeeded(self, v);
        c->per_fusion_vals_[self].erase(v);
        c->vals_.erase(v);
        return true;
      });
    }
    Thread safety verification

    The PR introduces shared container support for parallel compilation. While the code
    uses mutex locks (unique_lock/shared_lock), the interaction between the fast path
    (single fusion) and slow path (multiple fusions) should be verified for race conditions,
    especially around the size check of sharing_fusions_ and the actual removal operations.

    if (c->sharing_fusions_.size() <= 1) {
      // Fast path: single Fusion owns this container, so the LIFO invariant
      // holds — self's newest statements are always at the global deque tail.
      // Remove expressions before values because we need to change Val::uses_.
      while (std::ssize(c->per_fusion_exprs_[self]) > num_exprs_before) {
        Expr* e = c->exprs_up_.back().get();
        NVF_ERROR(
            c->per_fusion_exprs_[self].count(e) > 0,
            "removeStatementsCreatedAfter: tail expr belongs to another Fusion");
        for (Val* out : e->outputs()) {
          out->setDefinition(nullptr);
        }
        for (Val* in : e->inputs()) {
          in->removeUse(e);
        }
        c->per_fusion_exprs_[self].erase(e);
        c->exprs_.erase(e);
        c->exprs_up_.pop_back();
      }
      while (std::ssize(c->per_fusion_vals_[self]) > num_vals_before) {
        Val* v = c->vals_up_.back().get();
        NVF_ERROR(
            c->per_fusion_vals_[self].count(v) > 0,
            "removeStatementsCreatedAfter: tail val belongs to another Fusion");
        nullOutShortcutIfNeeded(self, v);
        c->per_fusion_vals_[self].erase(v);
        c->vals_.erase(v);
        c->vals_up_.pop_back();
      }
    } else {
      // Slow path: shared container — other Fusions' statements may be
      // interleaved at the tail of the global deques. Use std::erase_if
      // (C++20) to scan forward: skip the first num_before of self's
      // statements (old, to keep), then erase the remainder (added during
      // the guard scope). Entered whenever the container is shared,
      // regardless of success or failure; if no new statements were added
      // the scan completes trivially. O(total statements in container).
      int64_t exprs_kept = 0;
      std::erase_if(c->exprs_up_, [&](const std::unique_ptr<Expr>& e_up) {
        Expr* e = e_up.get();
        if (c->per_fusion_exprs_[self].count(e) == 0) {
          return false; // belongs to another Fusion — keep
        }
        if (exprs_kept < num_exprs_before) {
          ++exprs_kept;
          return false; // self's old expr — keep
        }
        // self's new expr — remove (clean up uses and index maps first)
        for (Val* out : e->outputs()) {
          out->setDefinition(nullptr);
        }
        for (Val* in : e->inputs()) {
          in->removeUse(e);
        }
        c->per_fusion_exprs_[self].erase(e);
        c->exprs_.erase(e);
        return true;
      });
    
      int64_t vals_kept = 0;
      std::erase_if(c->vals_up_, [&](const std::unique_ptr<Val>& v_up) {
        Val* v = v_up.get();
        if (c->per_fusion_vals_[self].count(v) == 0) {
          return false; // belongs to another Fusion — keep
        }
        if (vals_kept < num_vals_before) {
          ++vals_kept;
          return false; // self's old val — keep
        }
        // self's new val — remove (null shortcut cache pointer if applicable)
        nullOutShortcutIfNeeded(self, v);
        c->per_fusion_vals_[self].erase(v);
        c->vals_.erase(v);
        return true;
      });
    }

    @mdavis36 mdavis36 force-pushed the md/phase2-thread-safety branch from 8fb976b to 31bccb9 Compare February 26, 2026 00:29
    @mdavis36 mdavis36 closed this Feb 26, 2026
    @mdavis36 mdavis36 force-pushed the md/segmenter-container-sharing branch from 504bde0 to 31bccb9 Compare February 26, 2026 00:29
    @mdavis36 mdavis36 reopened this Mar 2, 2026
    @github-actions
    Copy link

    github-actions bot commented Mar 2, 2026

    Description

    • Add protected Fusion constructor accepting existing shared_ptr for container sharing

    • Modify SegmentedFusion::makeFusion() to share completeFusion's container instead of creating new ones

    • Pre-allocate per_fusion_vals_/per_fusion_exprs_ in IrContainer::addFusion() to prevent rehash races

    • Add stress tests with 8 and 12 parallel segments to validate shared container under multi-thread contention

    Changes walkthrough

    Relevant files
    Enhancement
    fusion.h
    Add protected shared-container Fusion constructor               

    csrc/fusion.h

  • Add protected Fusion constructor declaration accepting shared_ptr
  • Document constructor purpose for makeFusion container sharing
  • +4/-0     
    fusion.cpp
    Implement shared-container Fusion constructor                       

    csrc/fusion.cpp

  • Implement shared-container constructor that registers Fusion with
    existing container
  • +6/-0     
    fusion_segmenter.cpp
    Use shared container in makeFusion                                             

    csrc/fusion_segmenter.cpp

  • Use shared container constructor in makeFusion() to share
    completeFusion's IrContainer
  • +2/-1     
    Bug fix
    container.cpp
    Pre-allocate fusion maps to prevent rehash races                 

    csrc/ir/container.cpp

  • Pre-allocate per_fusion_vals_ and per_fusion_exprs_ in addFusion() to
    prevent rehash races
  • +2/-0     
    Tests
    test_segmentation.cpp
    Add stress tests for shared container sharing                       

    tests/cpp/test_segmentation.cpp

  • Add SharedContainerStress8Segments test with 8 parallel segments in
    linear chain
  • Add SharedContainerStress12ParallelBranches test with 12 parallel
    branches
  • +84/-0   

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ No major issues detected

    @mdavis36 mdavis36 force-pushed the md/phase2-thread-safety branch from 31bccb9 to 0a32c16 Compare March 3, 2026 22:18
    @mdavis36 mdavis36 force-pushed the md/segmenter-container-sharing branch from 7489fca to 9a5298f Compare March 3, 2026 22:19
    @mdavis36
    Copy link
    Collaborator Author

    mdavis36 commented Mar 3, 2026

    !test

    @mdavis36 mdavis36 force-pushed the md/phase2-thread-safety branch from 0a32c16 to b62d5ff Compare March 4, 2026 01:13
    @mdavis36 mdavis36 force-pushed the md/segmenter-container-sharing branch from 9a5298f to 8bd6641 Compare March 4, 2026 01:13
    @mdavis36
    Copy link
    Collaborator Author

    mdavis36 commented Mar 4, 2026

    !test

    @mdavis36 mdavis36 force-pushed the md/phase2-thread-safety branch from b62d5ff to 6eda820 Compare March 4, 2026 01:32
    @mdavis36 mdavis36 force-pushed the md/segmenter-container-sharing branch from 8bd6641 to f54045f Compare March 4, 2026 01:32
    @mdavis36
    Copy link
    Collaborator Author

    mdavis36 commented Mar 4, 2026

    !test

    @mdavis36 mdavis36 force-pushed the md/segmenter-container-sharing branch from f54045f to 53d4ac2 Compare March 4, 2026 01:48
    @mdavis36 mdavis36 force-pushed the md/phase2-thread-safety branch from 6eda820 to 23bddaf Compare March 4, 2026 01:48
    @mdavis36
    Copy link
    Collaborator Author

    mdavis36 commented Mar 4, 2026

    !test

    @mdavis36 mdavis36 force-pushed the md/phase2-thread-safety branch from 23bddaf to d4478b1 Compare March 5, 2026 00:08
    @mdavis36 mdavis36 force-pushed the md/segmenter-container-sharing branch from b1873c8 to fd3cb8b Compare March 5, 2026 00:09
    @mdavis36 mdavis36 marked this pull request as ready for review March 5, 2026 00:10
    @greptile-apps
    Copy link
    Contributor

    greptile-apps bot commented Mar 5, 2026

    Greptile Summary

    This PR wires the segmenter into Phase 2's shared IrContainer infrastructure by making makeFusion() create segment Fusion objects with the existing complete-fusion container rather than fresh containers. The key additions are a protected shared-container constructor on Fusion, a fast/slow path split in ContainerMutator::removeStatementsCreatedAfter to handle LIFO-breaking interleaving under shared ownership, a nullOutShortcutIfNeeded helper replacing the deleted numValsExcludingShortcuts() method, and pre-allocation of per-fusion map entries in addFusion() to prevent rehash races at first registration.

    Key changes:

    • fusion.cpp / fusion.h — Protected Fusion(shared_ptr<IrContainer>) constructor; fast path preserves original LIFO pop-back; slow path uses std::erase_if with ownership filtering for shared containers.
    • fusion_segmenter.cppmakeFusion() uses the shared-container constructor; raw new is required because make_unique cannot access protected constructors even through friendship, though a comment explaining this is missing.
    • ir/container.cppaddFusion() pre-inserts per_fusion_vals_/per_fusion_exprs_ keys under the write lock; inline comment exceeds typical line-length limits.
    • statement_guard.cpp — Correctly switches to numVals() since the slow-path scan makes the shortcut exclusion workaround unnecessary.
    • tests/cpp/test_segmentation.cpp — Two new stress tests; SharedContainerStress8Segments uses explicit segment_set boundaries (reliable assertion), while SharedContainerStress12ParallelBranches's EXPECT_GE(6) may be fragile if the segmenter merges compatible reductions across inputs.
    • Subtle logic issue — Both slow-path erase_if predicates access inner sets via operator[] on the outer unordered_map. If the fusion key was erased by a prior clear() and no statements were re-registered before the guard's destructor fires, operator[] silently inserts a spurious empty entry, re-populating a key removeStatementsOwnedBy intentionally removed. Using .find() instead would avoid this side-effect.

    Confidence Score: 3/5

    • PR is generally sound but carries a subtle operator[] side-effect in the slow path and missing documentation for a non-obvious raw-new usage; should be addressed before merging.
    • The overall design is well-reasoned and the threading model (unique_lock wrapping ContainerMutator) is correct. However, the slow-path erase_if predicates use operator[] rather than find(), which can silently re-insert a fusion key that removeStatementsOwnedBy deliberately erased — a semantics mismatch that could mask bugs in future refactors. The missing comment on the raw-new pattern in makeFusion() is a maintainability risk. The EXPECT_GE(6) assertion in the 12-parallel-branches test may produce a false failure if the segmenter is improved. None of these are crash-level issues on the success path, but the operator[] concern is a correctness edge case on the rollback path.
    • csrc/fusion.cpp (slow-path erase_if operator[] side-effect at lines 248 and 270); csrc/fusion_segmenter.cpp (unexplained raw-new at line 1804)

    Important Files Changed

    Filename Overview
    csrc/fusion.cpp Adds shared-container constructor, fast/slow path split in removeStatementsCreatedAfter, and nullOutShortcutIfNeeded helper. The slow-path erase_if predicates access per_fusion_exprs_/per_fusion_vals_ via operator[], which can insert spurious empty entries when the fusion key has been erased by a prior clear().
    csrc/fusion.h Removes numValsExcludingShortcuts() and adds the protected shared-container constructor declaration. Changes are minimal and correct.
    csrc/fusion_segmenter.cpp makeFusion() now creates segment Fusions with the shared-container constructor instead of a fresh Fusion. Raw new is used correctly to bypass the protected-constructor / make_unique limitation, but lacks an explanatory comment.
    csrc/ir/container.cpp addFusion() pre-inserts per_fusion_vals_ and per_fusion_exprs_ keys under the write lock. The inline comment exceeds typical line-length limits and should be reformatted.
    csrc/statement_guard.cpp Switches from numValsExcludingShortcuts() to numVals(). The new ownership-filtered scan in the slow path makes the shortcut-exclusion workaround unnecessary; change is correct.
    tests/cpp/test_segmentation.cpp Adds two stress tests. SharedContainerStress8Segments uses 7 explicit segment_set boundaries, so SizeIs(8) is reliable. SharedContainerStress12ParallelBranches uses EXPECT_GE(6) which may be fragile if the segmenter merges compatible reductions across inputs.

    Sequence Diagram

    sequenceDiagram
        participant SF as SegmentedFusion
        participant F as Fusion (segment)
        participant IC as IrContainer (shared)
        participant CG as StatementGuard
    
        SF->>F: new Fusion(completeFusion()->ir_container_ptr())
        F->>IC: addFusion(this) [unique_lock]
        IC-->>IC: sharing_fusions_.insert(fusion)
        IC-->>IC: per_fusion_vals_[fusion] pre-insert
        IC-->>IC: per_fusion_exprs_[fusion] pre-insert
    
        SF->>F: Fusion::copy(completeFusion(), fusion_segment)
        F->>F: clear() → removeStatementsOwnedBy(self)
        Note over IC: pre-allocated keys erased here
    
        loop clone each Val/Expr
            F->>IC: registerVal / registerExpr [unique_lock]
            IC-->>IC: per_fusion_vals_[fusion].insert(v)
        end
    
        Note over SF,F: Parallel compilation per segment
    
        opt Error rollback path only
            CG->>F: ~StatementGuard()
            F->>IC: removeStatementsCreatedAfter [unique_lock]
            alt sharing_fusions_.size() <= 1 (fast path)
                IC-->>IC: LIFO pop_back on deques
            else sharing_fusions_.size() > 1 (slow path)
                IC-->>IC: std::erase_if scan — skip other Fusions' stmts
                IC-->>IC: remove only self's new stmts (O(N) total)
            end
        end
    
        F->>IC: removeFusion(this) [unique_lock] on ~Fusion()
    
    Loading

    Last reviewed commit: deb0a38

    Comment on lines +156 to +157
    per_fusion_vals_[fusion]; // Pre-insert key so no outer-map rehash occurs during concurrent val/expr registration
    per_fusion_exprs_[fusion];
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Pre-allocation immediately negated by Fusion::copy -> clear()

    The comment claims this pre-allocation prevents outer-map rehash during concurrent val/expr registration. However, the call chain in makeFusion immediately undoes this:

    new Fusion(container)  →  addFusion()  →  per_fusion_vals_[fusion_segment] = {}  // pre-allocated ✓
    Fusion::copy(completeFusion(), fusion_segment.get())
      →  fusion_segment->clear()
         →  ir_container_->removeStatementsOwnedBy(fusion_segment)
            →  per_fusion_vals_.erase(vals_it)   // key REMOVED! pre-allocation lost
    

    After removeStatementsOwnedBy erases the pre-allocated key, the very first registerVal call inside Fusion::copy executes per_fusion_vals_[fusion_segment].insert(val) with operator[] on a missing key, causing a new key insertion into the outer unordered_map. This new insertion can trigger a rehash, which is exactly what the pre-allocation was supposed to prevent.

    The race window is: valsOwnedBy() acquires shared_lock, returns a const& into per_fusion_vals_[X], releases the lock, and the caller then calls std::ssize on that reference. Between the lock release and std::ssize, another thread can insert the fusion_segment key (causing rehash), invalidating the reference — UB.

    A minimal fix that preserves the pre-allocation invariant is to clear the inner set in removeStatementsOwnedBy instead of erasing the outer key:

    // In removeStatementsOwnedBy, instead of:
    per_fusion_vals_.erase(vals_it);
    // use:
    vals_it->second.clear(); // keep the key; prevents re-insertion + rehash

    This does require a separate cleanup in removeFusion (which already holds unique_lock) to erase the now-empty key after the fusion is fully removed from sharing_fusions_.

    Comment on lines +1107 to +1109
    }

    FusionExecutorCache executor_cache(std::move(fusion));
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    EXPECT_GE(..., 6) assertion may be too strong and cause a false test failure

    The comment itself acknowledges uncertainty: "the segmenter may merge some compatible reductions." With 4 inputs × 3 axes, the segmenter can legally merge all axis-0 reductions (from all 4 inputs) into one segment, all axis-1 reductions into another, and all axis-2 reductions into a third — producing only 3 segments, not 6.

    If the segmenter is ever improved to merge reductions across inputs (a valid and correct optimisation), this test will EXPECT_GE(3, 6) and fail, even though the shared-container infrastructure is working perfectly. The stress test's correctness goal is already validated by testValidate; the segment-count assertion adds fragility without adding safety.

    Consider removing the segment-count assertion entirely, or replacing it with just EXPECT_GT(runtime->fusionSegments()->groups().size(), 1) to verify at least some segmentation occurred (which is what the test is designed to exercise):

    Suggested change
    }
    FusionExecutorCache executor_cache(std::move(fusion));
    EXPECT_GT(runtime->fusionSegments()->groups().size(), 1u);

    …rContainer
    
    Add protected Fusion constructor that accepts an existing
    shared_ptr<IrContainer>,
    allowing SegmentedFusion::makeFusion() to share the complete Fusion's
    container
    instead of creating a new one per segment.
    
    Changes:
    - fusion.h: Add protected Fusion(shared_ptr<IrContainer>) constructor
    declaration
    - fusion.cpp: Implement the shared-container constructor
    - ir/container.cpp: Pre-allocate per_fusion_vals_/per_fusion_exprs_ in
    addFusion()
      to prevent rehash races during concurrent segment compilation
    - fusion_segmenter.cpp: Use shared container constructor in makeFusion()
    
    This enables real cross-thread lock contention on the IrContainer mutex
    during
    parallel segment compilation, validating Phase 2's shared_ptr + mutex
    infrastructure
    under realistic multi-Fusion sharing conditions.
    mdavis36 and others added 4 commits March 4, 2026 17:29
    Two new SegmentationTest cases that exercise the shared IrContainer
    under higher concurrent load than existing tests (max 3 segments):
    
    - SharedContainerStress8Segments: linear chain with 7 segment_set
      boundaries, 8 segments compiling in parallel
    - SharedContainerStress12ParallelBranches: 4 inputs x 3 independent
      reductions each, at least 6 segments compiling in parallel
    
    Both pass with correct results, validating the shared_ptr + mutex
    infrastructure under real multi-thread contention.
    Statements cleaned up by statement guard need to be popped from the
    specific fusion only, not the entire IrContainer.
    Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
    @mdavis36 mdavis36 force-pushed the md/segmenter-container-sharing branch from 13d90fa to deb0a38 Compare March 5, 2026 01:36
    Comment on lines +248 to +271
    if (c->per_fusion_exprs_[self].count(e) == 0) {
    return false; // belongs to another Fusion — keep
    }
    if (exprs_kept < num_exprs_before) {
    ++exprs_kept;
    return false; // self's old expr — keep
    }
    // self's new expr — remove (clean up uses and index maps first)
    for (Val* out : e->outputs()) {
    out->setDefinition(nullptr);
    }
    for (Val* in : e->inputs()) {
    in->removeUse(e);
    }
    c->per_fusion_exprs_[self].erase(e);
    c->exprs_.erase(e);
    return true;
    });

    int64_t vals_kept = 0;
    std::erase_if(c->vals_up_, [&](const std::unique_ptr<Val>& v_up) {
    Val* v = v_up.get();
    if (c->per_fusion_vals_[self].count(v) == 0) {
    return false; // belongs to another Fusion — keep
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    operator[] on outer map can insert spurious empty entries

    Both slow-path predicates access the inner per-fusion sets via operator[] on the outer unordered_map:

    if (c->per_fusion_exprs_[self].count(e) == 0) { … }
    // and later:
    if (c->per_fusion_vals_[self].count(v) == 0) { … }

    operator[] on an unordered_map performs a key insertion (with a default-constructed value) when the key is absent. The key for self can legitimately be absent here because Fusion::copy()to->clear()removeStatementsOwnedBy(to) erases it, and it is only re-inserted by the first registerExpr/registerVal call. If a StatementGuard is destroyed in the zero-statements case (guard scope added nothing and the fusion was just cleared), the slow path will be entered — c->sharing_fusions_.size() > 1 — and the predicate will call operator[] on a missing key, inserting a spurious empty set.

    The behavior is still logically correct (count returns 0, predicate returns false, nothing is erased), but it silently re-populates a key that removeStatementsOwnedBy purposely removed. Use .find() to avoid the side-effect:

    auto exprs_it = c->per_fusion_exprs_.find(self);
    std::erase_if(c->exprs_up_, [&](const std::unique_ptr<Expr>& e_up) {
        Expr* e = e_up.get();
        if (exprs_it == c->per_fusion_exprs_.end() ||
            exprs_it->second.count(e) == 0) {
            return false; // belongs to another Fusion — keep
        }
        …
    });

    And similarly for the vals_up_ pass.

    Comment on lines +156 to +157
    per_fusion_vals_[fusion]; // Pre-insert key so no outer-map rehash occurs during concurrent val/expr registration
    per_fusion_exprs_[fusion];
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Overly long inline comment

    The inline comment on line 156 is very long (well past 80–100 characters), which will likely fail the project's clang-format or line-length lint checks:

    per_fusion_vals_[fusion];   // Pre-insert key so no outer-map rehash occurs during concurrent val/expr registration

    Consider breaking it into a preceding block comment:

    Suggested change
    per_fusion_vals_[fusion]; // Pre-insert key so no outer-map rehash occurs during concurrent val/expr registration
    per_fusion_exprs_[fusion];
    // Pre-insert keys so no outer-map rehash occurs during concurrent
    // val/expr registration by segment Fusions sharing this container.
    per_fusion_vals_[fusion];
    per_fusion_exprs_[fusion];

    Comment on lines +1804 to +1805
    auto fusion_segment =
    std::unique_ptr<Fusion>(new Fusion(completeFusion()->ir_container_ptr()));
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Raw new without explanatory comment

    Using new Fusion(...) directly instead of std::make_unique<Fusion>(...) is the correct approach here — make_unique cannot access protected constructors even through friendship — but nothing in the code explains this to future maintainers. A one-line comment would prevent a well-intentioned "fix" that breaks compilation:

    Suggested change
    auto fusion_segment =
    std::unique_ptr<Fusion>(new Fusion(completeFusion()->ir_container_ptr()));
    // The shared-container constructor is protected; use raw new via friendship
    // rather than std::make_unique (which cannot access protected ctors).
    auto fusion_segment =
    std::unique_ptr<Fusion>(new Fusion(completeFusion()->ir_container_ptr()));

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    1 participant