Emit magic-number division for dynamic kernels by adedespirlet · Pull Request #1222 · iree-org/wave

adedespirlet · 2026-04-01T02:03:28Z

This PR introduces the magic number trick to replace costly arith.divui operations when dealing with dynamic divisors in our kernels.
Standard integer division (divui) is significantly slower than multiplication and bitwise shifts. The LLVM backend can replace static divisors by either a simple bit shift (when the divisor is a power of 2) or the magic number trick itself. However, when the divisor is dynamic, LLVM can't optimize this. So the purpose of this PR is to do the backend's job and replace dynamic divisors with the magic number trick directly.
The optimization is applied only when a dynamic divisor is used two or more times. Since the precomputation of the magic number itself requires one division, this ensures a net performance gain.
We also make sure the magic number computation is hoisted to the kernel entry block so it can be reused throughout the entire kernel.
Subsequent divisions using the same dynamic divisor are replaced with a mulhi (multiply-high), a shift, and a correction step.

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

adedespirlet · 2026-04-09T00:15:04Z

4 lit tests fail because they use dynamic symbols but their FileCheck lines don't account for the magic number IR. Two options: 1) Add magic_number_div=False to those tests to preserve their existing checks 2) Update the FileCheck lines to match the new IR

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

xintin · 2026-04-10T18:54:00Z

+        i32 = IntegerType.get_signless(32)
+        return arith_d.trunci(i32, hi_i64)
+
+    def _precompute_magic_number(divisor_index: Value):


For d=1, this equals 2^32, which doesn't fit in i32.

For d = 1: magic = (2^32 + 0) / 1 = 2^32, truncated to i32 = 0. Then _mulhi_u32(n, 0) = 0 for all n, so n // 1 would return 0 instead of n.
For d = 2: magic = (2^32 + 1) / 2 = 2147483649, which fits in i32. which is fine.
For d = 3: magic = (2^32 + 2) / 3 = 1431655766. fine and so on.

May be, dynamic divisors derived from kernel dimensions (BLOCK_M, BLOCK_N, etc.) are always>1, but better to have a guard here.

Right, I added cheap IR which checks at runtime whether the divisor==1, if it is, it then picks the dividend as quotient result and the remainder is set to 0.
This adds only 3 instructions, should be cheap.

xintin · 2026-04-10T18:56:58Z

@@ -261,7 +261,13 @@ def get_static_dim(s: Optional[IndexExpr]) -> int:
        return func_op

    def emit(self, graph: Optional[fx.Graph] = None) -> Operation:
+        global _magic_number_enabled, _magic_number_cache, _magic_entry_block, _magic_divisor_first_seen


setting module level globals at R642, and then resetting it here.
This is not thread-safe.
caches are read-write per-compilation state.

xintin · 2026-04-10T18:59:10Z

+            c = get_const_val(val)
+            if c is not None:
+                return ("const", c)
+            return ("val", id(val))


Check this once: id(val) is the memory address of the python object. Two structurally identical OpResult values that happen to be different python wrapper objects would get different keys. This will defeat caching.

xintin · 2026-04-10T19:03:15Z

+            arg_keys = []
+            for a in val.args:
+                c = get_const_val(a) if isinstance(a, OpResult) else None
+                arg_keys.append(("const", c) if c is not None else ("val", id(a)))


similar to this
("val", id(a)) uses the python object's memory address as the cache key for non-constant args.

xintin · 2026-04-10T19:06:39Z

+            return ("val", id(val))
+        return ("other", id(val))
+
+    def _should_use_magic(rhs_expr) -> bool:


For discussion:
On the first encounter, the divisor is recorded and magic apply is declined.
On the second, magic is approved and the precomputation is done. But this means the first division uses the slow arith.divui path while the second and subsequent use the fast magic path.

This creates an asymmetry. If a divisor appears in exactly floordiv then mod (common pattern for delinearization), the floordiv gets the slow path and only mod gets the fast path.
The floordiv result is computed via affine apply (which lowers to arith.divsi), then the mod is computed via magic. The two results will be inconsistent.

Do you think, a cleaner approach would be a two-pass design?

scan all expressions first to count divisor occurrences,

then emit with magic for any divisor that appears 2+ times.

xintin · 2026-04-10T19:11:22Z

+        r_i32 = arith_d.subi(n_i32, qd_i32)
+        # Correction: ceil(2^32/d) can overestimate quotient by 1.
+        # Detect via unsigned remainder >= divisor (wraps on overestimate).
+        too_big = arith_d.cmpi(arith_d.CmpIPredicate.uge, r_i32, d_i32)


The correction subtracts 1 from the quotient and adds d to the remainder when r >= d. This is correct for the ceil magic formula when d > 0. If d = 0 (division by zero), divui in the precomputation would produce undefined behavior. It is an existing undefined behavior possibility. We can add a note regarding this.

xintin · 2026-04-10T19:13:39Z

+        """Compute (quotient, remainder) of lhs_val // rhs via mulhi.
+
+        Uses unsigned 32-bit arithmetic (extui, divui, shrui, uge).
+        Requires both operands to be non-negative and fit in 32 bits.


NIt: docstring says "requires both operands to be non-negative" but doesn't explain why unsigned is used in the code when the rest of the emitter uses signed (arith.divsi / arith.remsi)

xintin · 2026-04-10T19:17:23Z

The algorithm looks correct. Before we enable this by default, can you share perf numbers across a range of tile sizes and shapes? Maybe:

Small M/N where index computation is a larger fraction of runtime
Large M/N where compute dominates
Varying BLOCK_M/BLOCK_N (64, 128, 256) to see how the number of dynamic divisions changes
With and without workgroup reordering (GROUP_SIZE_N), since that's the primary source of dynamic divisions
A table like (M, N, K, BLOCK_M, BLOCK_N) | baseline TFLOPS vs magic TFLOPS would help

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

adedespirlet added 4 commits April 1, 2026 02:01

add magic number trick for dynamic kernels

fc2fbe4

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

use compile options instead

64d1bfd

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

make sure to hoist the magic number computation before loop

e0160c6

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

expect 2 divisors in test instead of 4 due to caching

a0ac1e0

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

adedespirlet requested review from Hardcode84, harsh-nod and xintin April 9, 2026 00:15

adedespirlet force-pushed the magic_number branch from 455d5f5 to b546c19 Compare April 9, 2026 00:17

automatically detect if magic number trick is worth it

07a6063

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

adedespirlet force-pushed the magic_number branch from b546c19 to 07a6063 Compare April 9, 2026 00:19

set magic pass back to default off

421c49e

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

adedespirlet requested a review from panditsa April 9, 2026 18:42

xintin requested changes Apr 10, 2026

View reviewed changes

add guard when d==1

78cf1e8

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Emit magic-number division for dynamic kernels#1222

Emit magic-number division for dynamic kernels#1222
adedespirlet wants to merge 7 commits into
iree-org:mainfrom
adedespirlet:magic_number

adedespirlet commented Apr 1, 2026 •

edited

Loading

Uh oh!

adedespirlet commented Apr 9, 2026

Uh oh!

xintin Apr 10, 2026

Uh oh!

adedespirlet Apr 10, 2026

Uh oh!

xintin Apr 10, 2026

Uh oh!

xintin Apr 10, 2026

Uh oh!

xintin Apr 10, 2026

Uh oh!

xintin Apr 10, 2026

Uh oh!

xintin Apr 10, 2026

Uh oh!

xintin Apr 10, 2026

Uh oh!

xintin commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

adedespirlet commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adedespirlet commented Apr 9, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xintin commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

adedespirlet commented Apr 1, 2026 •

edited

Loading