Skip to content

Fix int32 overflow in matvec row offset and gather-MM batch stride#3609

Open
aicayzer wants to merge 1 commit into
ml-explore:mainfrom
aicayzer:fix/3591-matmul-int32-overflow
Open

Fix int32 overflow in matvec row offset and gather-MM batch stride#3609
aicayzer wants to merge 1 commit into
ml-explore:mainfrom
aicayzer:fix/3591-matmul-int32-overflow

Conversation

@aicayzer
Copy link
Copy Markdown

Fixes #3591 (the size-overflow half — see notes below on the second reported issue).

Problem

gemv.metal's matvec kernel computes the matrix row offset as out_row * matrix_ld in int32, then advances the mat pointer. For large matrices the product wraps above 2^31 → wrong rows → silent corruption (no crash, no throw). Reporter's case is a 12347 × 174000 fp32 matrix * vector:

  • 12347 × 174000 ≈ 2.15e9 > 2^31 = 2.147e9
  • Relative error 0.06–0.25, intermittent run-to-run (the high bit's value depends on the buffer's allocated address modulo the wrap)

Affected sites at HEAD 2e6632e:

  • mlx/backend/metal/kernels/gemv.metal:151GEMVKernel::run matrix advance (the reporter's actual repro path)
  • mlx/backend/metal/kernels/gemv.metal:359, 375GEMVTKernel::run transpose variant
  • mlx/backend/metal/matmul.cpp:2360, 2590gather_mm and segmented_mm pass M * N (int32) as the int64_t batch_stride_d parameter, truncating before the widen

The N>1 plain-GEMM path is already 64-bit-safe via the steel kernel work in #1087, so this PR doesn't touch it.

Fix

Widen the row-offset multiplications to size_t before the multiply (gemv) and compute M * N as int64_t at the two matmul.cpp dispatch sites. Same pattern as the existing c_row_long * params->ldd cast in the steel kernels.

Test

python/tests/test_blas.py::TestBlas::test_matvec_large_matrix_int32_offset reproduces the reporter's case and compares against a chunked reference. Pre-fix the relative error is 0.06–0.25; post-fix it sits in fp32 noise (~1e-6).

The matrix is ~8.6 GB and the chunked reference adds another ~10 GB, so the test is gated on mx.device_info()["memory_size"] >= 24 GB and skips otherwise. CI will skip; high-RAM Apple Silicon (the reporter has 128 GB) will exercise it.

Notes for maintainers

A couple of things worth flagging before review:

The matvec path (`gemv.metal`) computes the matrix row offset as
`out_row * matrix_ld` in int32 and adds it to `mat`. For large
matrices it truncates above 2^31 and silently returns wrong results.
Reporter's repro at ml-explore#3591: 12347 x 174000 fp32 matrix * vector
(product 2.15e9 > 2^31) gives relative error 0.06-0.25, intermittent
run-to-run. Same pattern in the GEMVTKernel transpose variant.

The gather-MM and segmented-MM dispatchers in `matmul.cpp` pass
`M * N` (int32) as the int64 `batch_stride_d` parameter — same
truncation before the widen.

Fix:
- gemv.metal: widen the row-offset multiplications to size_t before
  the multiply (matches the steel kernel pattern hardened in ml-explore#1087).
- matmul.cpp: compute `M * N` as int64_t at the two dispatch sites.

Adds a memory-gated regression test reproducing the reporter's case
when ~24 GB of unified memory is available; CI machines skip.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] matmul with size greater than 2^31 gives undermined results - appears to be memory leak

1 participant