Skip to content

Skip direct-dominating live-in optimization for loop headers#281

Merged
guosran merged 5 commits intomainfrom
fix/nested-loop-live-in-bug
Mar 10, 2026
Merged

Skip direct-dominating live-in optimization for loop headers#281
guosran merged 5 commits intomainfrom
fix/nested-loop-live-in-bug

Conversation

@guosran
Copy link
Collaborator

@guosran guosran commented Mar 8, 2026

Summary

This pr addresses issue #270.

Problem

In nested loop kernels, values defined in outer loop blocks and used in inner loop bodies were incorrectly classified as "direct dominating live-ins" by CanonicalizeLiveInPass. This prevented them from being promoted to block arguments, which in turn caused TransformCtrlToDataFlowPass to not create inner-rate PHI_START operations for them.

Fix

Before applying the direct-dominating optimization, check whether the using block is a loop header. If so, skip the optimization and let the value be promoted to a block argument.

Result

Before the fix, %14 (the outer-rate GEP) has no reserve/phi_start for the pointer. It is fed directly into the inner-loop body via a plain data_mov, with no rate-matching:

// Outer-rate GEP — runs once per outer iteration (index_per_ii=3)
%14 = "neura.gep"(%13) <{operandSegmentSizes = array<i32: 0, 1>}> {dfg_id = 42 : i32,
    lhs_value = "%arg0",
    mapping_locs = [{id = 0 : i32, index_per_ii = 3 : i32, ...}]}
    : (!neura.data<i64, i1>) -> !neura.data<!llvm.ptr, i1>

// No reserve + phi_start for the pointer — the value is forwarded as-is
%30 = "neura.data_mov"(%14) {dfg_id = 46 : i32,
    mapping_locs = [{id = 0 : i32, index_per_ii = 3 : i32, ...}]}
    : (!neura.data<!llvm.ptr, i1>) -> !neura.data<!llvm.ptr, i1>

// Inner-loop GEP directly consumes the outer-rate data_mov (index_per_ii=4)
%32 = "neura.gep"(%30, %31) <{operandSegmentSizes = array<i32: 1, 1>}> {dfg_id = 49 : i32,
    mapping_locs = [{id = 1 : i32, index_per_ii = 4 : i32, ...}]}
    : (!neura.data<!llvm.ptr, i1>, !neura.data<i64, i1>) -> !neura.data<!llvm.ptr, i1>

After the fix, %14 is promoted to a block argument. A reserve + phi_start pair is generated, giving the pointer an inner-rate PHI_START

// Outer-rate GEP — same as before (index_per_ii=3)
%14 = "neura.gep"(%13) <{operandSegmentSizes = array<i32: 0, 1>}> {dfg_id = 43 : i32,
    lhs_value = "%arg0",
    mapping_locs = [{id = 0 : i32, index_per_ii = 3 : i32, ...}]}
    : (!neura.data<i64, i1>) -> !neura.data<!llvm.ptr, i1>

// reserve + phi_start — inner-rate slot for the pointer (NEW)
%24 = neura.reserve {dfg_id = 8 : i32} : !neura.data<!llvm.ptr, i1>
%25 = "neura.data_mov"(%14) {dfg_id = 47 : i32,
    mapping_locs = [{id = 1 : i32, index_per_ii = 3 : i32, ...}]}
    : (!neura.data<!llvm.ptr, i1>) -> !neura.data<!llvm.ptr, i1>
%26 = neura.phi_start %25, %24 {dfg_id = 50 : i32,
    mapping_locs = [{id = 4 : i32, index_per_ii = 4 : i32, ...}]}
    : !neura.data<!llvm.ptr, i1>, !neura.data<!llvm.ptr, i1> -> !neura.data<!llvm.ptr, i1>

// Inner-loop GEP now consumes the rate-matched phi_start output (index_per_ii=6)
%35 = "neura.gep"(%33, %34) <{operandSegmentSizes = array<i32: 1, 1>}> {dfg_id = 63 : i32,
    mapping_locs = [{id = 5 : i32, index_per_ii = 6 : i32, ...}]}
    : (!neura.data<!llvm.ptr, i1>, !neura.data<i64, i1>) -> !neura.data<!llvm.ptr, i1>

Copilot AI review requested due to automatic review settings March 8, 2026 05:01
@guosran
Copy link
Collaborator Author

guosran commented Mar 8, 2026

Commits from the previous unmerged pr is shown here. This pr should start from 1f60123 (the first failed commit).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes an SSA live-in canonicalization bug impacting nested loops by preventing the “direct-dominating live-in” optimization from firing on loop headers, enabling correct promotion to block arguments and subsequent inner-rate PHI_START creation during ctrl-to-data-flow lowering. It also renames/rewires mapping and task placement passes and updates affected tests/docs accordingly.

Changes:

  • Update CanonicalizeLiveInPass to skip the direct-dominating-live-in optimization for loop headers (fix for #270).
  • Rename MapToAcceleratorPassMapOperationOnTilePass and update pass registration, build files, docs, and tests.
  • Replace MapTaskOnCgraPass with AllocateCgraToTaskPass and invoke global placement from ResourceAwareTaskOptimizationPass, updating taskflow tests.

Reviewed changes

Copilot reviewed 43 out of 46 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
test/neura/steer_ctrl/loop_with_return_value.mlir Updates disabled mapping pass invocation name in test pipeline comments.
test/neura/fusion/test.mlir Updates mapping pass name and adjusts FileCheck IDs after mapping changes.
test/neura/for_loop/relu_test.mlir Updates mapping pass flag name in RUN line.
test/neura/ctrl/branch_for.mlir Updates mapping pass flag name in RUN lines.
test/multi-cgra/taskflow/resource-heavy/resource-heavy.mlir Splits long RESOPT attribute checks for readability/stability.
test/multi-cgra/taskflow/resnet/simple_resnet_tosa.mlir Splits long RESOPT attribute checks for readability/stability.
test/multi-cgra/taskflow/parallel-nested/parallel-nested.mlir Switches to --allocate-cgra-to-task and updates RESOPT/placement checks.
test/multi-cgra/taskflow/multi-nested/multi-nested.mlir Switches to --allocate-cgra-to-task and updates placement + RESOPT expectations.
test/multi-cgra/taskflow/irregular-loop/irregular-loop.mlir Switches to --allocate-cgra-to-task and splits RESOPT checks.
test/multi-cgra/kernel_mapping/relu/relu.mlir Updates mapping pass flag name in RUN line.
test/multi-cgra/kernel_mapping/loop-in-kernel/loop-in-kernel.mlir Updates mapping pass flag name in RUN line.
test/multi-cgra/kernel_mapping/fir/fir.mlir Updates mapping pass flag name in RUN line.
test/mapping_quality/tiny_loop.mlir Updates mapping pass flag name in RUN lines.
test/mapping_quality/branch_for.mlir Updates mapping pass flag name in RUN lines.
test/honor_arch/fir_removed_tiles_test.mlir Updates mapping pass flag name in RUN line.
test/e2e/relu/relu_kernel.mlir Updates mapping pass flag name in RUN line.
test/e2e/histogram/histogram_kernel.mlir Updates mapping pass flag name in RUN line.
test/e2e/gemv/gemv_kernel.mlir Updates mapping pass flag name and refreshes extensive mapping/YAML/ASM FileCheck expectations.
test/e2e/fir/fir_kernel_vec.mlir Updates mapping pass flag name in RUN line.
test/e2e/fir/fir_kernel.mlir Updates mapping pass flag name in RUN line.
test/e2e/axpy/axpy_kernel.mlir Updates mapping pass flag name in RUN line.
test/controflow_fuse/simple_loop_reduction/simple_loop_reduction.mlir Updates mapping pass flag name in RUN line.
test/controflow_fuse/simple_loop/simple_loop.mlir Updates mapping pass flag name in RUN line.
test/controflow_fuse/perfect_nested/perfect_nested.mlir Updates mapping pass flag name in RUN line.
test/code_gen/test_code_generate.mlir Updates mapping pass flag name in RUN line.
test/c2llvm2mlir/simple_loop/test.mlir Updates mapping pass flag name in RUN line.
test/c2llvm2mlir/nested_loop/test.mlir Updates mapping pass flag name in RUN line.
test/arch_spec/README.md Updates documentation to use --map-operation-on-tile.
lib/TaskflowDialect/Transforms/Optimizations/ResourceAwareTaskOptimizationPass.cpp Renames mapper references and invokes AllocateCgraToTask-based placement at convergence.
lib/TaskflowDialect/Transforms/CMakeLists.txt Swaps in AllocateCgraToTaskPass.cpp and adds Optimizations subdirectory.
lib/TaskflowDialect/Transforms/AllocateCgraToTaskPass.cpp Implements new task+SRAM placement pass with multi-CGRA placement search.
lib/TaskflowDialect/CMakeLists.txt Adjusts subdirectory structure so Optimizations is included via Transforms.
lib/NeuraDialect/Transforms/MapOperationOnTilePass.cpp Renames mapper pass (class/argument/logs) and updates factory function.
lib/NeuraDialect/Transforms/CanonicalizeLiveInPass.cpp Skips direct-dominating-live-in optimization when the using block is a loop header.
lib/NeuraDialect/Transforms/CMakeLists.txt Replaces MapToAcceleratorPass.cpp with MapOperationOnTilePass.cpp.
lib/NeuraDialect/NeuraPasses.cpp Updates conversion pipeline to run createMapOperationOnTilePass().
include/TaskflowDialect/TaskflowPasses.td Renames pass definition to allocate-cgra-to-task.
include/TaskflowDialect/TaskflowPasses.h Updates pass factory name and adds runAllocateCgraToTask() helper declaration.
include/NeuraDialect/NeuraPasses.td Renames pass definition to map-operation-on-tile and updates constructor name.
include/NeuraDialect/NeuraPasses.h Renames pass factory API to createMapOperationOnTilePass.
include/NeuraDialect/Architecture/Architecture.h Updates reference in comment to new mapper pass name.
Comments suppressed due to low confidence (4)

lib/TaskflowDialect/Transforms/AllocateCgraToTaskPass.cpp:285

  • placement.primary() returns {-1,-1} when placement.cgra_positions is empty, but the code still pushes it into task_node->placement. That can result in invalid task_mapping_info (row/col = -1) and can skew the SRAM centroid computation in later iterations. Handle the no-placement case explicitly (e.g., signal pass failure / emit an error, or keep the task unannotated and skip SRAM assignment) before committing any placement.
    lib/TaskflowDialect/Transforms/AllocateCgraToTaskPass.cpp:281
  • When a placement can’t be found for cgra_count, the pass falls back to cgra_count-1 but does not update the task’s cgra_count (or tile_shape) attribute accordingly. That can leave IR in an inconsistent state where cgra_count disagrees with the number of entries in task_mapping_info.cgra_positions, which downstream passes may rely on. Consider either failing the pass, or updating the attributes to match the chosen fallback placement (and possibly iterating fallback down to 1).
    lib/TaskflowDialect/Transforms/AllocateCgraToTaskPass.cpp:432
  • parseTileShapeOffsets() is introduced but never used, and the placement search currently ignores each task’s tile_shape attribute (even though other code paths/comments imply placement should respect it). If tile_shape is intended to constrain the physical shape, wire this helper into findBestPlacement() (read the task’s tile_shape StringAttr, validate it matches cgra_count, then call tryPlaceShape() with those offsets). Otherwise, remove the dead helper / update comments to reflect that the allocator chooses shapes autonomously.
    lib/TaskflowDialect/Transforms/AllocateCgraToTaskPass.cpp:553
  • The non-rectangular shape search uses a 64-bit bitmask (1ULL << (nr * grid_cols_ + nc)). If grid_rows * grid_cols >= 64, this shift becomes undefined and the mask cannot represent all cells. Add a guard/assert limiting the supported grid size (e.g., 4x4) or switch to a wider/dynamic bitset representation for visited states.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@guosran guosran force-pushed the fix/nested-loop-live-in-bug branch from 1f60123 to fa5fb36 Compare March 8, 2026 05:31
Copy link
Collaborator

@Jackcuii Jackcuii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

work perfectly on simulator, good job bro!

Values crossing from outer blocks to inner loop headers were marked as
direct-dominating live-ins, preventing block-arg promotion. This caused
missing inner-rate PHI_STARTs in the dataflow IR, starving inner-loop
operations of valid data each cycle.

Added back-edge check: if the using block has a predecessor it dominates
(i.e. it is a loop header), the value is promoted to a block argument.
@guosran guosran force-pushed the fix/nested-loop-live-in-bug branch from fa5fb36 to 19f5088 Compare March 9, 2026 18:09
@tancheng
Copy link
Contributor

tancheng commented Mar 9, 2026

Summary

This pr addresses issue #270.

Problem

In nested loop kernels, values defined in outer loop blocks and used in inner loop bodies were incorrectly classified as "direct dominating live-ins" by CanonicalizeLiveInPass. This prevented them from being promoted to block arguments, which in turn caused TransformCtrlToDataFlowPass to not create inner-rate PHI_START operations for them.

Fix

Before applying the direct-dominating optimization, check whether the using block is a loop header. If so, skip the optimization and let the value be promoted to a block argument.

Result

After the fix, the GEP pointer is correctly promoted to a block argument in ^bb4:

^bb1:
  %6 = neura.gep(%arg0, shl(i, 4))
  neura.br ..., %6 : !llvm.ptr, ... to ^bb4

^bb4(%15: i64, %16: i32, %17: !llvm.ptr, ...):  // %17 is now a block arg
  gep(%17, %15)

And TransformCtrlToDataFlowPass creates an inner-rate PHI_START for it:

%9  = neura.gep(%arg0, shl(i, 4))          // outer-rate GEP
%17 = neura.phi_start %9, %16              // inner-rate PHI_START for the pointer
%22 = neura.gep(%17, %21)                  // inner-loop body uses inner-rate pointer

TODO

Have not updated all failed checks yet, will do so once the changes are reviewed and approved. gemv has been updated.

Can we also provide the IRs about what it would look like w/o this fix (in the Problem section of the PR description)?

@guosran
Copy link
Collaborator Author

guosran commented Mar 9, 2026

Can we also provide the IRs about what it would look like w/o this fix (in the Problem section of the PR description)?

Updated.

@guosran guosran merged commit dc75783 into main Mar 10, 2026
1 check passed
@guosran guosran deleted the fix/nested-loop-live-in-bug branch March 10, 2026 01:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants