fix(nix): drop tcl from sqlite on musl cross hosts to unblock fh-cache#3450
fix(nix): drop tcl from sqlite on musl cross hosts to unblock fh-cache#3450
Conversation
Two bugs in tcl 8.6.16 (pinned via nixpkgs 09061f74...) were cascading
into failures of mls-validation-service, bindings-node-js-napi, and the
devour-output aggregates on the Cache all Nix Outputs workflow:
1. compat/mkstemp.c calls strlen() without including <string.h>, which
gcc 15 promotes from warning to error under the musl cross toolchain
(-Wimplicit-function-declaration). Hit on x86_64-unknown-linux-musl
from any build host.
2. unix/tcl.m4's SC_CONFIG_SYSTEM reads `uname -s` on the *build* host
to set tcl_cv_sys_version. On the warp-macos-26-arm64-12x runner
this becomes Darwin-*, which selects the MAC_OSX_TCL / MAC_OSX_OBJS
code path and tries to compile tclMacOSXFCmd.c / include
mach/mach_time.h against a linux-musl sysroot. Hit on
aarch64-unknown-linux-musl from the macOS runner.
Add an overlay in nix/lib/default.nix (applied to both native and
mkCrossPkgs imports) that:
- appends a postPatch to inject `#include <string.h>` into
compat/mkstemp.c via substituteInPlace --replace-fail (fail-loud
if the marker line ever changes upstream);
- appends a preConfigure that exports tcl_cv_sys_version=Linux when
stdenv.hostPlatform.isLinux, so cross-builds to linux targets
never consult the build host's uname;
- rebinds tcl to the patched tcl-8_6 so the alias and explicit
version resolve to the same derivation.
Verified locally:
- x86_64-unknown-linux-musl tcl cross-build succeeds from cold cache
(Bug 1 regression).
- aarch64-unknown-linux-musl tcl cross-build succeeds with a stacked
overlay that simulates the Darwin host by forcing
tcl_cv_sys_version=Darwin-24.0.0 before configure. Without the fix
this reproduces the macOS header errors; with the fix applied on
top, the export is overridden and the build completes (Bug 2
regression).
- Native x86_64-linux tcl still builds and runs (tclsh returns
8.6.16).
- Native aarch64-darwin tcl derivation's preConfigure is unchanged
(still just 'cd unix'), so native Darwin builds are untouched.
Remove this override when the nixpkgs pin is bumped past a rev that
adds the missing include and honors the autoconf host triple in
SC_CONFIG_SYSTEM.
Resolves xmtp#3444
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous commit (4d17631) tried to patch tcl 8.6.16 in an overlay by exporting tcl_cv_sys_version=Linux in preConfigure, redirecting the SC_CONFIG_SYSTEM autoconf macro. That was incomplete: tcl's unix configure script has a *separate* `uname -s` check at configure.in:557 that unconditionally defines TCL_WIDE_CLICKS + MAC_OSX_TCL on Darwin build hosts, which causes tclUnixTime.c to include <mach/mach_time.h> against a linux-musl sysroot regardless of the autoconf cache variable. A correct tcl patch would require either regenerating configure with autoreconf or patching both the generated configure script and the MAC_OSX_SRCS makefile variable — fragile across nixpkgs revisions. Also, neekolas pointed out on the issue that `nix build .#validation-service-image` works on his Apple Silicon Mac. Reason: cache.nixos.org hosts the x86_64-linux-host cross tcl outputs (Hydra builds them) but not the aarch64-darwin-host cross outputs. So the tcl bugs only materialize on cold darwin-host cross builds — which is exactly what the fh-cache workflow started doing after PR xmtp#3408 activated "build while caching" on the new warp-macos-26-arm64-12x runner with a fresh flake.lock that invalidated the WarpBuilds cache. Rather than fight tcl, sidestep it: sqlite only depends on tcl for its tclsqlite3 extension and its test harness. libxmtp consumes libsqlite3 directly, so --disable-tcl is safe. sqlite's autosetup uses the bundled autosetup/jimsh0.c for its own code generation when tcl is disabled (see sqlite autosetup/sqlite-config.tcl). This also means no cross-tcl gets built for any libxmtp target, since sqlite is the only consumer. The overlay is gated on stdenv.hostPlatform.isMusl so native sqlite on linux and darwin keeps substituting from cache.nixos.org unchanged. Verified: native sqlite drv hash is identical with and without the overlay on both x86_64-linux (53j5kr3aq86...) and aarch64-darwin (9lkza91vfmh...), so no cache-miss regression for non-musl consumers. Validation (local): - x86_64-unknown-linux-musl sqlite cross-builds cleanly, closure has zero tcl entries (verified via `nix derivation show`). - aarch64-unknown-linux-musl sqlite cross-builds cleanly. - End-to-end: mls-validation-service-x86_64-unknown-linux-musl builds successfully from a cold store, proving the full failure chain is broken. See xmtp#3444 for full analysis.
| # aarch64-darwin), so builds from a darwin host hit the bugs cold. | ||
| # See https://github.com/xmtp/libxmtp/issues/3444. | ||
| # | ||
| # Symptoms seen in CI on warp-macos-26-arm64-12x: |
There was a problem hiding this comment.
in reality we should probably replace all sqlite references with sqlcipher, since that's what we actually use, but this fix is ok for now
There was a problem hiding this comment.
Ack — and agreed that a sqlcipher migration would be a cleaner long-term story. For this PR I traced the actual closure to make sure the override lands in the right place:
$ nix-store --query --requisites \
$(nix eval --raw .#packages.x86_64-linux.mls-validation-service-x86_64-unknown-linux-musl.drvPath) \
| grep -E '(sqlite|sqlcipher|tcl)'
Findings:
sqlite-x86_64-unknown-linux-musl-3.51.2.drvis in the closure (pulled vialibsqlite3-sys→rusqlite) — this is the derivation the overlay targets, andnix derivation showon it confirms zero tcl ininputs.drvs,--disable-tclinconfigureFlags, and no tcl innativeBuildInputs.sqlcipheris not in the closure at all for the mls-validation-service nix package — it's only referenced innix/shells/{local,rust}.nix(dev shell) andnix/package/xnet-gui.nix. Thebindings-node-js-napiandmls-validation-servicemusl builds link against plainlibsqlite3vialibsqlite3-sys, not sqlcipher.- Native
tcl-8.6.16.drv(the build-host tcl, not cross) remains in the closure — pulled by native sqlite as a build-time tool. That's fine because native tcl on darwin builds against darwin headers and the cross-compile bug is only tripped when--host=*-linux-muslmeetstcl_cv_sys_version=Darwin.
So for the failing chain (tcl-*-unknown-linux-musl → sqlite-*-unknown-linux-musl → bindings-node-js-napi / mls-validation-service), dropping tcl from the musl cross sqlite is the minimum that breaks it. A full sqlcipher migration would be a bigger surface change and wouldn't move this issue forward — sqlcipher also has a tcl dep, it'd just be pulling native tcl (same as sqlite does today) rather than cross-tcl. Happy to open a follow-up issue for the sqlcipher unification if you'd like.
…3472) Resolves #3470 ## Summary - Pre-seeds `kyua_cv_getopt_plus=yes` in the atf package override when cross-compiling (`buildPlatform != hostPlatform`), fixing the `AC_RUN_IFELSE` failure that aborts `atf` configure with "cannot run test program while cross compiling" - Unblocks the `aarch64-apple-darwin` build chain: `atf -> libiconv -> apple-sdk-14.4 -> bindings-node-js-napi-*` - Gated on cross-compilation only, so native builds continue pulling from `cache.nixos.org` unchanged ## Context The `atf-0.23` configure script (`m4/module-application.m4`) uses `AC_RUN_IFELSE` to check whether `getopt(3)` accepts a leading `+` for POSIX behavior. During cross-compilation the compiled test binary cannot execute on the build host, causing configure to abort. All target platforms in this flake (Darwin, glibc Linux, musl Linux) support `+` in getopt, so pre-seeding `yes` is correct. This follows the same pattern as the sqlite/tcl cross-compilation fix from #3450. ## Test plan - [ ] CI "Cache all Nix Outputs" workflow passes on `warp-macos-26-arm64-12x` runner - [ ] Native Darwin builds still substitute from cache (no derivation hash change for native atf) <!-- Macroscope's pull request summary starts here --> <!-- Macroscope will only edit the content between these invisible markers, and the markers themselves will not be visible in the GitHub rendered markdown. --> <!-- If you delete either of the start / end markers from your PR's description, Macroscope will append its summary at the bottom of the description. --> > [!NOTE] > ### Pre-seed `kyua_cv_getopt_plus` configure flag in `atf` to fix cross-compilation > During cross-compilation, autoconf cache variables are not automatically populated, causing the `atf` build to fail. The overlay in [default.nix](https://github.com/xmtp/libxmtp/pull/3472/files#diff-6fd175d36064b2e9b9371596932bf201514494fc02c317f3f4526243d72991f1) now appends `kyua_cv_getopt_plus=yes` to `atf`'s `configureFlags` when `buildPlatform != hostPlatform`. Native builds are unaffected. > > <!-- Macroscope's review summary starts here --> > > <sup><a href="https://app.macroscope.com">Macroscope</a> summarized 4127f19.</sup> > <!-- Macroscope's review summary ends here --> > <!-- macroscope-ui-refresh --> <!-- Macroscope's pull request summary ends here --> Co-authored-by: xmtp-coder-agent <xmtp-coder-agent@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Resolves #3444
Summary
The
Cache all Nix Outputsworkflow has been failing repeatedly on thewarp-macos-26-arm64-12xrunner becausetcl 8.6.16(pinned via nixpkgs rev09061f74…) does not cross-compile to{x86_64,aarch64}-unknown-linux-muslfrom a Darwin build host. There are two independent tcl bugs:compat/mkstemp.cis missing<string.h>—strlen()is used without a declaration; gcc 15.2.0 promotes-Wimplicit-function-declarationto a hard error. Host-independent.unix/configuredefinesTCL_WIDE_CLICKS+MAC_OSX_TCLfrom the build host'suname -s— specifically,configure.in:557has an unguardedif test "`uname -s`" = "Darwin"branch that runs regardless of the--hosttriple. On a Darwin build host it unconditionally defines both macros, andtclUnixTime.cthen pulls in<mach/mach_time.h>against a linux-musl sysroot. Darwin-build-host-only.The cascade is
tcl → sqlite → cargo-package-deps → bindings-node-js-napi / mls-validation-service → devour-output.Why this only hits CI, not neekolas's laptop
cache.nixos.orghosts cross-compile outputs keyed by build-host. Hydra only builds thex86_64-linux → {x86_64,aarch64}-linux-muslcross chain, notaarch64-darwin → *-linux-musl. A developer on an Apple Silicon Mac who has ever had the target in their local store, uses a remote Linux builder, or populates/nix/storeviajust backend/dev/upcan substitute the result and never touch the compile path. A cold macOS build is the only thing that reliably reproduces — which is exactly what #3408 made the fh-cache workflow do (active-build + invalidated WarpBuilds cache + new runner image + freshflake.lock).Previous attempt
Commit 4d17631 tried to patch tcl by exporting
tcl_cv_sys_version=LinuxinpreConfigure, redirecting tcl'sSC_CONFIG_SYSTEMautoconf macro. That was incomplete: the rogueuname -scheck atconfigure.in:557is a separate code path that the autoconf cache variable does not touch. A correct tcl patch would need to regenerateconfigurewith autoreconf or patch both the generated script and theMAC_OSX_SRCSmakefile variable — fragile across nixpkgs revisions.Fix
Rather than fight tcl, sidestep it.
sqliteonly depends on tcl for itstclsqlite3extension and its test harness. libxmtp consumeslibsqlite3directly viadiesel/rusqlite, so--disable-tclis safe. sqlite's autosetup uses the bundledautosetup/jimsh0.cfor its own code generation when tcl is absent (documented atautosetup/sqlite-config.tcl:243).The new overlay in
nix/lib/default.nix:stdenv.hostPlatform.isMusl— no-op for every other target.--with-tcl=...fromconfigureFlagsand appends--disable-tcl(a first-party supported configuration in the nixpkgs sqlite derivation; it's already enabled onisStaticatsqlite/default.nix:85).tcl*packages out ofnativeBuildInputs.doCheck— sqlite's test suite runssrctree-check.tclvia tclsh and would pull tcl back in otherwise.Validation (all local)
Musl x86_64 cross sqlite: builds cleanly; closure has zero
tclentries (verified vianix derivation show—inputs.drvscontains no tcl,env.configureFlagshas only--disable-tcl,env.nativeBuildInputshas no tcl).Musl aarch64 cross sqlite: builds cleanly.
End-to-end:
mls-validation-service-x86_64-unknown-linux-musl(the canonical downstream target in the failing workflow) builds successfully from a cold store. Output:/nix/store/fybjfq0yjd5dk6hmd9jr2n2ymky4s9v0-mls-validation-service-x86_64-unknown-linux-musl-1.10.0.Native sqlite invariance: native drv hashes identical with and without the overlay:
x86_64-linux:53j5kr3aq86wnyczwlmlh0a0n48nhnk4-sqlite-3.51.2.drv(both)aarch64-darwin:9lkza91vfmh6h5f2r3vdg1w04m3ymn0f-sqlite-3.51.2.drv(both)⇒ non-musl consumers continue to substitute from
cache.nixos.orgwith no cache-miss regression.nixfmt-rfc-styleclean onnix/lib/default.nix.CI (pending): the
Cache all Nix Outputsworkflow onwarp-macos-26-arm64-12xis the only test that exercises the actual cold darwin-host cross path; this PR's CI run is the definitive end-to-end check.Test plan
nix buildof musl cross sqlite for both x86_64 and aarch64 succeeds locally.nix derivation showconfirms no tcl in the overridden sqlite'sinputs.drvs,env.configureFlags, orenv.nativeBuildInputs.mls-validation-service-x86_64-unknown-linux-muslbuilds end-to-end from a cold store.nixfmt-rfc-styleclean.Cache all Nix Outputsworkflow passes onwarp-macos-26-arm64-12x.🤖 Generated with Claude Code
Note
Disable TCL in sqlite builds on musl cross hosts
Adds a Nix overlay in nix/lib/default.nix that overrides the
sqlitederivation when the host platform is musl. The override strips any--with-tcl=configure flag, appends--disable-tcl, removes tcl-related entries fromnativeBuildInputs, and setsdoCheck = false.Macroscope summarized 2d99c88.