Skip to content

[VL] Fix gflags dual-registration abort on macOS arm64#12100

Open
jackylee-ch wants to merge 1 commit into
apache:mainfrom
jackylee-ch:fix-macos-gflags-dual-registration
Open

[VL] Fix gflags dual-registration abort on macOS arm64#12100
jackylee-ch wants to merge 1 commit into
apache:mainfrom
jackylee-ch:fix-macos-gflags-dual-registration

Conversation

@jackylee-ch
Copy link
Copy Markdown
Contributor

What changes are proposed in this pull request?

When loading libgluten.dylib on macOS arm64, the JVM aborts during the
System.loadLibrary call with:

ERROR: flag 'flagfile' was defined more than once
       (in files '.../gflags.cc' and '.../gflags.cc')
       ... is being linked both statically and dynamically

The root cause is dyld weak-symbol coalescing across two dylibs that each
contain their own copy of gflags:

Dylib gflags origin
libvelox.dylib static libgflags.a baked in via Folly (Velox builds Folly with -DGFLAGS_SHARED=FALSE)
libgluten.dylib dynamic libgflags.dylib pulled transitively through glog::glog / Folly::folly INTERFACE_LINK_LIBRARIES

On macOS, dyld coalesces the weak C++ function-local-static guard inside
FlagRegistry::GlobalRegistry() between the two dylibs. Both copies then
register --flagfile against the same registry and gflags' duplicate-flag
check aborts the process before any user code runs.

Linux is unaffected because (a) ELF does not coalesce weak symbols across
shared objects by default, and (b) Gluten already uses symbols.map to
control the export surface of libgluten.so. macOS has no version-script
equivalent, so this PR uses a different mechanism. All Darwin-specific
logic is gated on APPLE / CMAKE_SYSTEM_NAME STREQUAL "Darwin"; Linux
and Windows build and link semantics are untouched.

The fix has five parts that all need to be in place to fully eliminate the
abort across the production load path and the test executables:

  1. cpp/CMake/Findglog.cmake — On Darwin, prefer the static
    libglog.a and force gflags_component=static. When both archives
    are available we replace the imported google::glog target with an
    INTERFACE IMPORTED target whose INTERFACE_LINK_OPTIONS carry
    LINKER:-load_hidden,<libglog.a> and
    LINKER:-load_hidden,<libgflags.a>. -load_hidden is the Apple ld64
    flag that gives every symbol pulled from the archive hidden
    visibility, which prevents dyld from coalescing them across dylibs.
    We resolve the static gflags archive path by inspecting
    IMPORTED_LOCATION_RELEASE / _NOCONFIG / * on
    gflags::gflags_static.

  2. cpp/core/utils/GflagsStubDarwin.cc (new) — Exports a no-op
    google::HandleCommandLineHelpFlags with default visibility. Velox's
    archive of gflags pulls gflags.cc.o but never references
    gflags_reporting.cc.o, so once -load_hidden makes the real copy
    invisible, the dynamic linker would fail to resolve this symbol at
    dlopen time. The stub resolves it from libgluten.dylib instead.

  3. cpp/core/CMakeLists.txt — Conditionally adds the stub to the
    gluten target on APPLE.

  4. cpp/velox/CMakeLists.txt — On Darwin, links google::glog as
    PUBLIC on the velox target so its INTERFACE_LINK_OPTIONS
    propagate through libvelox.dylib to test binaries and benchmarks.
    The default PRIVATE linkage on gluten is intentional for Linux
    (symbols.map handles it), but on Darwin Folly::folly's
    INTERFACE_LINK_LIBRARIES pulls libgflags.a into libvelox.dylib
    and any test executables with default visibility, reviving the same
    dual-registration abort at test startup.

  5. cpp/velox/compute/VeloxBackend.cc — Guards
    google::InitGoogleLogging with IsGoogleLoggingInitialized() and
    makes VeloxBackend::create() idempotent. Multi-suite gtest binaries
    on macOS re-enter VeloxBackend::init from each SetUpTestSuite,
    otherwise triggering glog's "You called InitGoogleLogging() twice!"
    check and Gluten's Registry "Required object already registered"
    check.

How was this patch tested?

Built on macOS 14 arm64 with Apple Clang 17 and the Homebrew toolchain.

Symbol audit (after the fix):

$ nm -g libvelox.dylib | grep "google.*ParseCommandLine"
(empty)

$ nm libvelox.dylib | awk '/FlagRegistry/ {print $2}' | sort | uniq -c
   3 b
  21 t

All FlagRegistry symbols are lowercase (t = local text, b = local
bss); none are exported across the dylib boundary, so dyld has nothing
to coalesce.

Behavioral validation:

  • Before the fix, dlopen("libgluten.dylib") aborts before any test
    reaches main().
  • After the fix, cpp/build/velox/tests/velox_shuffle_writer_test runs
    5436 / 5436 cases cleanly on macOS 14 arm64.
  • Spark 3.5 + Velox backend Java JUnit canaries (the JNI-only suites
    that exercise native load without query execution) all pass on macOS
    arm64:
    • org.apache.gluten.utils.VeloxBloomFilterTest
    • org.apache.gluten.columnarbatch.ColumnarBatchTest
    • org.apache.gluten.backendsapi.VeloxListenerApiTest
    • org.apache.gluten.fs.OnHeapFileSystemTest
    • org.apache.gluten.vectorized.ArrowColumnVectorTest
  • Full ctest of cpp/build reports 5574 / 5585 pass; the 11 failures
    are unrelated upstream Velox issues exposed by the recent
    dft-2026_05_13 bump (HYPERLOGLOG cast registration tightening,
    Type::equivalent() regression on identically-printed ROW types) —
    not caused by this PR.

Linux:

  • Linux x86_64 build green; all changes are gated behind APPLE /
    Darwin checks, so no behavioral change on Linux is expected. Local
    Ubuntu build verified clean.

Was this patch authored or co-authored using generative AI tooling?

co-auth: Claude (Sonnet/Opus) via Claude Code 1.x

@github-actions github-actions Bot added the VELOX label May 17, 2026
On macOS arm64, libvelox.dylib has static gflags baked in via Folly
(Velox builds Folly with -DGFLAGS_SHARED=FALSE). Without special
handling, libgluten.dylib transitively pulls dynamic gflags via the
INTERFACE_LINK_LIBRARIES of glog::glog and Folly::folly. At JVM
load time, dyld coalesces the weak C++ function-local-static guard
inside FlagRegistry::GlobalRegistry() across the two dylibs. Both
copies then register "flagfile" against the same registry and gflags'
duplicate-flag check aborts the process before any user code runs:

  ERROR: flag 'flagfile' was defined more than once
         (in files '.../gflags.cc' and '.../gflags.cc')
         ... is being linked both statically and dynamically.

Linux is unaffected because (a) ELF does not coalesce weak symbols
across .so boundaries by default, and (b) Gluten already uses
symbols.map to control libgluten.so's export surface. macOS has no
version-script equivalent, so a different mechanism is required.

This change fixes the abort end-to-end for macOS while leaving Linux
and Windows build/link semantics untouched.

1. cpp/CMake/Findglog.cmake: on Darwin, prefer the static libglog.a
   and force gflags_component=static. When both archives are present
   we replace the imported google::glog target with an INTERFACE
   IMPORTED target whose INTERFACE_LINK_OPTIONS carry
   `LINKER:-load_hidden,<libglog.a>` and
   `LINKER:-load_hidden,<libgflags.a>`. -load_hidden is the Apple
   ld64 flag that gives every symbol pulled from the archive hidden
   visibility, preventing dyld from coalescing them across dylibs.
   The static gflags archive path is resolved by inspecting
   IMPORTED_LOCATION_RELEASE / _NOCONFIG / * on
   gflags::gflags_static.

2. cpp/core/utils/GflagsStubDarwin.cc (new): exports a no-op
   google::HandleCommandLineHelpFlags with default visibility.
   Velox's archive of gflags pulls gflags.cc.o but never references
   gflags_reporting.cc.o, so once -load_hidden makes the real copy
   invisible, the dynamic linker would fail to resolve this symbol
   at dlopen time. The stub resolves it from libgluten.dylib instead.

3. cpp/core/CMakeLists.txt: conditionally adds the stub to the
   gluten target on APPLE.

4. cpp/velox/CMakeLists.txt: on Darwin, links google::glog as PUBLIC
   on the velox target so its INTERFACE_LINK_OPTIONS propagate
   through libvelox.dylib to test binaries and benchmarks. PRIVATE
   linkage on the gluten target is intentional for Linux (symbols.map
   handles it), but on Darwin Folly::folly's INTERFACE_LINK_LIBRARIES
   pulls libgflags.a into libvelox.dylib and any test executables
   with default visibility, reviving the same dual-registration
   abort at test startup.

5. cpp/velox/compute/VeloxBackend.cc: guards
   google::InitGoogleLogging with IsGoogleLoggingInitialized() and
   makes VeloxBackend::create() idempotent. Multi-suite gtest
   binaries on macOS re-enter VeloxBackend::init from each
   SetUpTestSuite, otherwise triggering glog's "You called
   InitGoogleLogging() twice!" check and Gluten's
   Registry "Required object already registered" check.

Verification:
- nm -g libvelox.dylib | grep "google.*ParseCommandLine" -> empty
  (gflags symbols are not exported across the dylib boundary)
- nm libvelox.dylib | grep FlagRegistry -> all lowercase t / b
  (every FlagRegistry symbol is local to libvelox.dylib)
- velox_shuffle_writer_test runs 5436/5436 cases cleanly on
  macOS 14 arm64 with Apple Clang 17.
- Linux x86_64 build green, no link/load behavior change.
@jackylee-ch jackylee-ch force-pushed the fix-macos-gflags-dual-registration branch from 4c19348 to defe11e Compare May 17, 2026 18:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant