Emit magic-number division for dynamic kernels#1222
Conversation
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
|
4 lit tests fail because they use dynamic symbols but their FileCheck lines don't account for the magic number IR. Two options: 1) Add magic_number_div=False to those tests to preserve their existing checks 2) Update the FileCheck lines to match the new IR |
455d5f5 to
b546c19
Compare
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
b546c19 to
07a6063
Compare
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
| i32 = IntegerType.get_signless(32) | ||
| return arith_d.trunci(i32, hi_i64) | ||
|
|
||
| def _precompute_magic_number(divisor_index: Value): |
There was a problem hiding this comment.
For d=1, this equals 2^32, which doesn't fit in i32.
For d = 1: magic = (2^32 + 0) / 1 = 2^32, truncated to i32 = 0. Then _mulhi_u32(n, 0) = 0 for all n, so n // 1 would return 0 instead of n.
For d = 2: magic = (2^32 + 1) / 2 = 2147483649, which fits in i32. which is fine.
For d = 3: magic = (2^32 + 2) / 3 = 1431655766. fine and so on.
May be, dynamic divisors derived from kernel dimensions (BLOCK_M, BLOCK_N, etc.) are always>1, but better to have a guard here.
There was a problem hiding this comment.
Right, I added cheap IR which checks at runtime whether the divisor==1, if it is, it then picks the dividend as quotient result and the remainder is set to 0.
This adds only 3 instructions, should be cheap.
| @@ -261,7 +261,13 @@ def get_static_dim(s: Optional[IndexExpr]) -> int: | |||
| return func_op | |||
|
|
|||
| def emit(self, graph: Optional[fx.Graph] = None) -> Operation: | |||
| global _magic_number_enabled, _magic_number_cache, _magic_entry_block, _magic_divisor_first_seen | |||
There was a problem hiding this comment.
setting module level globals at R642, and then resetting it here.
This is not thread-safe.
caches are read-write per-compilation state.
| c = get_const_val(val) | ||
| if c is not None: | ||
| return ("const", c) | ||
| return ("val", id(val)) |
There was a problem hiding this comment.
Check this once: id(val) is the memory address of the python object. Two structurally identical OpResult values that happen to be different python wrapper objects would get different keys. This will defeat caching.
| arg_keys = [] | ||
| for a in val.args: | ||
| c = get_const_val(a) if isinstance(a, OpResult) else None | ||
| arg_keys.append(("const", c) if c is not None else ("val", id(a))) |
There was a problem hiding this comment.
similar to this
("val", id(a)) uses the python object's memory address as the cache key for non-constant args.
| return ("val", id(val)) | ||
| return ("other", id(val)) | ||
|
|
||
| def _should_use_magic(rhs_expr) -> bool: |
There was a problem hiding this comment.
For discussion:
On the first encounter, the divisor is recorded and magic apply is declined.
On the second, magic is approved and the precomputation is done. But this means the first division uses the slow arith.divui path while the second and subsequent use the fast magic path.
This creates an asymmetry. If a divisor appears in exactly floordiv then mod (common pattern for delinearization), the floordiv gets the slow path and only mod gets the fast path.
The floordiv result is computed via affine apply (which lowers to arith.divsi), then the mod is computed via magic. The two results will be inconsistent.
Do you think, a cleaner approach would be a two-pass design?
- scan all expressions first to count divisor occurrences,
- then emit with magic for any divisor that appears 2+ times.
| r_i32 = arith_d.subi(n_i32, qd_i32) | ||
| # Correction: ceil(2^32/d) can overestimate quotient by 1. | ||
| # Detect via unsigned remainder >= divisor (wraps on overestimate). | ||
| too_big = arith_d.cmpi(arith_d.CmpIPredicate.uge, r_i32, d_i32) |
There was a problem hiding this comment.
The correction subtracts 1 from the quotient and adds d to the remainder when r >= d. This is correct for the ceil magic formula when d > 0. If d = 0 (division by zero), divui in the precomputation would produce undefined behavior. It is an existing undefined behavior possibility. We can add a note regarding this.
| """Compute (quotient, remainder) of lhs_val // rhs via mulhi. | ||
|
|
||
| Uses unsigned 32-bit arithmetic (extui, divui, shrui, uge). | ||
| Requires both operands to be non-negative and fit in 32 bits. |
There was a problem hiding this comment.
NIt: docstring says "requires both operands to be non-negative" but doesn't explain why unsigned is used in the code when the rest of the emitter uses signed (arith.divsi / arith.remsi)
|
The algorithm looks correct. Before we enable this by default, can you share perf numbers across a range of tile sizes and shapes? Maybe: Small M/N where index computation is a larger fraction of runtime |
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
This PR introduces the magic number trick to replace costly arith.divui operations when dealing with dynamic divisors in our kernels.
Standard integer division (divui) is significantly slower than multiplication and bitwise shifts. The LLVM backend can replace static divisors by either a simple bit shift (when the divisor is a power of 2) or the magic number trick itself. However, when the divisor is dynamic, LLVM can't optimize this. So the purpose of this PR is to do the backend's job and replace dynamic divisors with the magic number trick directly.
The optimization is applied only when a dynamic divisor is used two or more times. Since the precomputation of the magic number itself requires one division, this ensures a net performance gain.
We also make sure the magic number computation is hoisted to the kernel entry block so it can be reused throughout the entire kernel.
Subsequent divisions using the same dynamic divisor are replaced with a mulhi (multiply-high), a shift, and a correction step.