Skip to content

macOS arm64: JIT kernels run with garbage args because runFunction dispatches via variadic typedef #810

@slitvinov

Description

@slitvinov

On macOS arm64 (Apple Silicon) JIT-compiled OCCA kernels are called with garbage register state, producing silently wrong output (and segfaults in larger callers).

occa::sys::runFunction dispatches kernels through a function pointer typed as variadic:

https://github.com/libocca/occa/blob/main/src/occa/internal/utils/sys.hpp#L15

typedef void (*functionPtr_t)(...);

The generated runFunction.cpp_codegen calls f directly:

case 4:
    f(args[0], args[1], args[2], args[3]);
    break;

Apple's arm64 ABI passes all variadic args on the stack, while the AArch64 PCS used on Linux and by JIT-compiled OCCA kernels passes the first 8 in registers x0-x7. So on Apple Silicon the kernel reads its args from registers that were never set by the caller and runs on uninitialized values.

Apple's docs on the difference.

Reproducer

repro.cpp:

#include <iostream>
#include <vector>

#include <occa.hpp>

static const char *kernel_source = R"OKL(
@kernel void addVectors(const int entries,
                        const float *a,
                        const float *b,
                        float *ab) {
  for (int i = 0; i < entries; ++i; @tile(4, @outer, @inner)) {
    ab[i] = a[i] + b[i];
  }
}
)OKL";

int main() {
  const int entries = 8;

  std::vector<float> a(entries, 1.0f), b(entries, 2.0f), ab(entries, 0.0f);

  occa::device device({{"mode", "Serial"}});

  occa::memory o_a  = device.malloc<float>(entries, a.data());
  occa::memory o_b  = device.malloc<float>(entries, b.data());
  occa::memory o_ab = device.malloc<float>(entries);

  occa::kernel addVectors = device.buildKernelFromString(kernel_source,
                                                         "addVectors");

  addVectors(entries, o_a, o_b, o_ab);

  o_ab.copyTo(ab.data());
  for (int i = 0; i < entries; ++i) {
    std::cout << ab[i] << ' ';
  }
  std::cout << '\n';

  return 0;
}

Build:

c++ -std=c++17 repro.cpp -I${OCCA_HOME}/include -L${OCCA_HOME}/lib -locca \
    -Wl,-rpath,${OCCA_HOME}/lib -o repro

Run on macOS arm64:

$ ./repro
0 0 0 0 0 0 0 0

The kernel ran, but entries was read from x0 (uninitialized, happened to be 0), so the loop body never executed. With a different register state at call time we have observed segfaults dereferencing garbage a/b/ab pointers.

Expected:

3 3 3 3 3 3 3 3

Tested on macOS 26 / Apple M-series, Apple clang 21, OCCA built with -O2 -g.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions