Skip to content

Emit magic-number division for dynamic kernels#1222

Open
adedespirlet wants to merge 7 commits into
iree-org:mainfrom
adedespirlet:magic_number
Open

Emit magic-number division for dynamic kernels#1222
adedespirlet wants to merge 7 commits into
iree-org:mainfrom
adedespirlet:magic_number

Conversation

@adedespirlet
Copy link
Copy Markdown
Contributor

@adedespirlet adedespirlet commented Apr 1, 2026

This PR introduces the magic number trick to replace costly arith.divui operations when dealing with dynamic divisors in our kernels.
Standard integer division (divui) is significantly slower than multiplication and bitwise shifts. The LLVM backend can replace static divisors by either a simple bit shift (when the divisor is a power of 2) or the magic number trick itself. However, when the divisor is dynamic, LLVM can't optimize this. So the purpose of this PR is to do the backend's job and replace dynamic divisors with the magic number trick directly.
The optimization is applied only when a dynamic divisor is used two or more times. Since the precomputation of the magic number itself requires one division, this ensures a net performance gain.
We also make sure the magic number computation is hoisted to the kernel entry block so it can be reused throughout the entire kernel.
Subsequent divisions using the same dynamic divisor are replaced with a mulhi (multiply-high), a shift, and a correction step.

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
@adedespirlet
Copy link
Copy Markdown
Contributor Author

4 lit tests fail because they use dynamic symbols but their FileCheck lines don't account for the magic number IR. Two options: 1) Add magic_number_div=False to those tests to preserve their existing checks 2) Update the FileCheck lines to match the new IR

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
@adedespirlet adedespirlet requested a review from panditsa April 9, 2026 18:42
i32 = IntegerType.get_signless(32)
return arith_d.trunci(i32, hi_i64)

def _precompute_magic_number(divisor_index: Value):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For d=1, this equals 2^32, which doesn't fit in i32.

For d = 1: magic = (2^32 + 0) / 1 = 2^32, truncated to i32 = 0. Then _mulhi_u32(n, 0) = 0 for all n, so n // 1 would return 0 instead of n.
For d = 2: magic = (2^32 + 1) / 2 = 2147483649, which fits in i32. which is fine.
For d = 3: magic = (2^32 + 2) / 3 = 1431655766. fine and so on.

May be, dynamic divisors derived from kernel dimensions (BLOCK_M, BLOCK_N, etc.) are always>1, but better to have a guard here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I added cheap IR which checks at runtime whether the divisor==1, if it is, it then picks the dividend as quotient result and the remainder is set to 0.
This adds only 3 instructions, should be cheap.

@@ -261,7 +261,13 @@ def get_static_dim(s: Optional[IndexExpr]) -> int:
return func_op

def emit(self, graph: Optional[fx.Graph] = None) -> Operation:
global _magic_number_enabled, _magic_number_cache, _magic_entry_block, _magic_divisor_first_seen
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setting module level globals at R642, and then resetting it here.
This is not thread-safe.
caches are read-write per-compilation state.

c = get_const_val(val)
if c is not None:
return ("const", c)
return ("val", id(val))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check this once: id(val) is the memory address of the python object. Two structurally identical OpResult values that happen to be different python wrapper objects would get different keys. This will defeat caching.

arg_keys = []
for a in val.args:
c = get_const_val(a) if isinstance(a, OpResult) else None
arg_keys.append(("const", c) if c is not None else ("val", id(a)))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to this
("val", id(a)) uses the python object's memory address as the cache key for non-constant args.

return ("val", id(val))
return ("other", id(val))

def _should_use_magic(rhs_expr) -> bool:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For discussion:
On the first encounter, the divisor is recorded and magic apply is declined.
On the second, magic is approved and the precomputation is done. But this means the first division uses the slow arith.divui path while the second and subsequent use the fast magic path.

This creates an asymmetry. If a divisor appears in exactly floordiv then mod (common pattern for delinearization), the floordiv gets the slow path and only mod gets the fast path.
The floordiv result is computed via affine apply (which lowers to arith.divsi), then the mod is computed via magic. The two results will be inconsistent.

Do you think, a cleaner approach would be a two-pass design?

  1. scan all expressions first to count divisor occurrences,
  2. then emit with magic for any divisor that appears 2+ times.

r_i32 = arith_d.subi(n_i32, qd_i32)
# Correction: ceil(2^32/d) can overestimate quotient by 1.
# Detect via unsigned remainder >= divisor (wraps on overestimate).
too_big = arith_d.cmpi(arith_d.CmpIPredicate.uge, r_i32, d_i32)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The correction subtracts 1 from the quotient and adds d to the remainder when r >= d. This is correct for the ceil magic formula when d > 0. If d = 0 (division by zero), divui in the precomputation would produce undefined behavior. It is an existing undefined behavior possibility. We can add a note regarding this.

"""Compute (quotient, remainder) of lhs_val // rhs via mulhi.

Uses unsigned 32-bit arithmetic (extui, divui, shrui, uge).
Requires both operands to be non-negative and fit in 32 bits.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIt: docstring says "requires both operands to be non-negative" but doesn't explain why unsigned is used in the code when the rest of the emitter uses signed (arith.divsi / arith.remsi)

@xintin
Copy link
Copy Markdown
Contributor

xintin commented Apr 10, 2026

The algorithm looks correct. Before we enable this by default, can you share perf numbers across a range of tile sizes and shapes? Maybe:

Small M/N where index computation is a larger fraction of runtime
Large M/N where compute dominates
Varying BLOCK_M/BLOCK_N (64, 128, 256) to see how the number of dynamic divisions changes
With and without workgroup reordering (GROUP_SIZE_N), since that's the primary source of dynamic divisions
A table like (M, N, K, BLOCK_M, BLOCK_N) | baseline TFLOPS vs magic TFLOPS would help

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants