Fix int32 overflow in matvec row offset and gather-MM batch stride by aicayzer · Pull Request #3609 · ml-explore/mlx

aicayzer · 2026-05-30T22:21:39Z

Fixes #3591 (the size-overflow half — see notes below on the second reported issue).

Problem

gemv.metal's matvec kernel computes the matrix row offset as out_row * matrix_ld in int32, then advances the mat pointer. For large matrices the product wraps above 2^31 → wrong rows → silent corruption (no crash, no throw). Reporter's case is a 12347 × 174000 fp32 matrix * vector:

12347 × 174000 ≈ 2.15e9 > 2^31 = 2.147e9
Relative error 0.06–0.25, intermittent run-to-run (the high bit's value depends on the buffer's allocated address modulo the wrap)

Affected sites at HEAD 2e6632e:

mlx/backend/metal/kernels/gemv.metal:151 — GEMVKernel::run matrix advance (the reporter's actual repro path)
mlx/backend/metal/kernels/gemv.metal:359, 375 — GEMVTKernel::run transpose variant
mlx/backend/metal/matmul.cpp:2360, 2590 — gather_mm and segmented_mm pass M * N (int32) as the int64_t batch_stride_d parameter, truncating before the widen

The N>1 plain-GEMM path is already 64-bit-safe via the steel kernel work in #1087, so this PR doesn't touch it.

Fix

Widen the row-offset multiplications to size_t before the multiply (gemv) and compute M * N as int64_t at the two matmul.cpp dispatch sites. Same pattern as the existing c_row_long * params->ldd cast in the steel kernels.

Test

python/tests/test_blas.py::TestBlas::test_matvec_large_matrix_int32_offset reproduces the reporter's case and compares against a chunked reference. Pre-fix the relative error is 0.06–0.25; post-fix it sits in fp32 noise (~1e-6).

The matrix is ~8.6 GB and the chunked reference adds another ~10 GB, so the test is gated on mx.device_info()["memory_size"] >= 24 GB and skips otherwise. CI will skip; high-RAM Apple Silicon (the reporter has 128 GB) will exercise it.

Notes for maintainers

A couple of things worth flagging before review:

Shape: the analogous conv-offset point-fixes Fix int32 overflow in Metal conv_general output offset for large tensors #3294 and Fix conv_general output offset overflow in Metal writeback #3320 were closed unmerged, and [BUG] Arrays with >= 2^31 elements fail materialization and some indexing on Metal #3327 framed this as ShapeElem int32 → int64. Happy to redo as a throw-on-overflow guard (the Detect int32 shape-product overflow at MLX compute-shape boundaries #3524 / Clearer error when shape dimension overflows int32 #3425 pattern) or fold into a broader refactor if you'd prefer — please say.
Second issue in [BUG] matmul with size greater than 2^31 gives undermined results - appears to be memory leak #3591: the reporter also mentioned that strided/sliced inputs into a large parent give wrong results unless materialised (slice + 0). That's almost certainly a separate stride/offset bug rather than the gemv kernels, and isn't addressed here. Happy to file as a separate issue if useful.
Memory-gated test: CONTRIBUTING.md rule 2 asks for tests; the most direct one needs ~24 GB and won't run in CI. Open to a synthetic boundary-throw test or alternative validation if that's preferred.

The matvec path (`gemv.metal`) computes the matrix row offset as `out_row * matrix_ld` in int32 and adds it to `mat`. For large matrices it truncates above 2^31 and silently returns wrong results. Reporter's repro at ml-explore#3591: 12347 x 174000 fp32 matrix * vector (product 2.15e9 > 2^31) gives relative error 0.06-0.25, intermittent run-to-run. Same pattern in the GEMVTKernel transpose variant. The gather-MM and segmented-MM dispatchers in `matmul.cpp` pass `M * N` (int32) as the int64 `batch_stride_d` parameter — same truncation before the widen. Fix: - gemv.metal: widen the row-offset multiplications to size_t before the multiply (matches the steel kernel pattern hardened in ml-explore#1087). - matmul.cpp: compute `M * N` as int64_t at the two dispatch sites. Adds a memory-gated regression test reproducing the reporter's case when ~24 GB of unified memory is available; CI machines skip.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix int32 overflow in matvec row offset and gather-MM batch stride#3609

Fix int32 overflow in matvec row offset and gather-MM batch stride#3609
aicayzer wants to merge 1 commit into
ml-explore:mainfrom
aicayzer:fix/3591-matmul-int32-overflow

aicayzer commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aicayzer commented May 30, 2026

Problem

Fix

Test

Notes for maintainers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant