Skip to content

fix(nix): drop tcl from sqlite on musl cross hosts to unblock fh-cache#3450

Merged
neekolas merged 3 commits intoxmtp:mainfrom
xmtp-coder-agent:fix/issue-3444
Apr 13, 2026
Merged

fix(nix): drop tcl from sqlite on musl cross hosts to unblock fh-cache#3450
neekolas merged 3 commits intoxmtp:mainfrom
xmtp-coder-agent:fix/issue-3444

Conversation

@xmtp-coder-agent
Copy link
Copy Markdown
Contributor

@xmtp-coder-agent xmtp-coder-agent commented Apr 10, 2026

Resolves #3444

Summary

The Cache all Nix Outputs workflow has been failing repeatedly on the warp-macos-26-arm64-12x runner because tcl 8.6.16 (pinned via nixpkgs rev 09061f74…) does not cross-compile to {x86_64,aarch64}-unknown-linux-musl from a Darwin build host. There are two independent tcl bugs:

  1. compat/mkstemp.c is missing <string.h>strlen() is used without a declaration; gcc 15.2.0 promotes -Wimplicit-function-declaration to a hard error. Host-independent.
  2. unix/configure defines TCL_WIDE_CLICKS + MAC_OSX_TCL from the build host's uname -s — specifically, configure.in:557 has an unguarded if test "`uname -s`" = "Darwin" branch that runs regardless of the --host triple. On a Darwin build host it unconditionally defines both macros, and tclUnixTime.c then pulls in <mach/mach_time.h> against a linux-musl sysroot. Darwin-build-host-only.

The cascade is tcl → sqlite → cargo-package-deps → bindings-node-js-napi / mls-validation-service → devour-output.

Why this only hits CI, not neekolas's laptop

cache.nixos.org hosts cross-compile outputs keyed by build-host. Hydra only builds the x86_64-linux → {x86_64,aarch64}-linux-musl cross chain, not aarch64-darwin → *-linux-musl. A developer on an Apple Silicon Mac who has ever had the target in their local store, uses a remote Linux builder, or populates /nix/store via just backend / dev/up can substitute the result and never touch the compile path. A cold macOS build is the only thing that reliably reproduces — which is exactly what #3408 made the fh-cache workflow do (active-build + invalidated WarpBuilds cache + new runner image + fresh flake.lock).

Previous attempt

Commit 4d17631 tried to patch tcl by exporting tcl_cv_sys_version=Linux in preConfigure, redirecting tcl's SC_CONFIG_SYSTEM autoconf macro. That was incomplete: the rogue uname -s check at configure.in:557 is a separate code path that the autoconf cache variable does not touch. A correct tcl patch would need to regenerate configure with autoreconf or patch both the generated script and the MAC_OSX_SRCS makefile variable — fragile across nixpkgs revisions.

Fix

Rather than fight tcl, sidestep it. sqlite only depends on tcl for its tclsqlite3 extension and its test harness. libxmtp consumes libsqlite3 directly via diesel/rusqlite, so --disable-tcl is safe. sqlite's autosetup uses the bundled autosetup/jimsh0.c for its own code generation when tcl is absent (documented at autosetup/sqlite-config.tcl:243).

The new overlay in nix/lib/default.nix:

(
  final: prev:
  prev.lib.optionalAttrs prev.stdenv.hostPlatform.isMusl {
    sqlite = prev.sqlite.overrideAttrs (old: {
      configureFlags =
        (prev.lib.filter (f: !(prev.lib.hasPrefix "--with-tcl=" f)) old.configureFlags)
        ++ [ "--disable-tcl" ];
      nativeBuildInputs = prev.lib.filter (p: !(prev.lib.hasPrefix "tcl" (p.pname or ""))) (
        old.nativeBuildInputs or [ ]
      );
      doCheck = false;
    });
  }
)
  • Gated on stdenv.hostPlatform.isMusl — no-op for every other target.
  • Strips --with-tcl=... from configureFlags and appends --disable-tcl (a first-party supported configuration in the nixpkgs sqlite derivation; it's already enabled on isStatic at sqlite/default.nix:85).
  • Filters tcl* packages out of nativeBuildInputs.
  • Disables doCheck — sqlite's test suite runs srctree-check.tcl via tclsh and would pull tcl back in otherwise.

Validation (all local)

  • Musl x86_64 cross sqlite: builds cleanly; closure has zero tcl entries (verified via nix derivation showinputs.drvs contains no tcl, env.configureFlags has only --disable-tcl, env.nativeBuildInputs has no tcl).

  • Musl aarch64 cross sqlite: builds cleanly.

  • End-to-end: mls-validation-service-x86_64-unknown-linux-musl (the canonical downstream target in the failing workflow) builds successfully from a cold store. Output: /nix/store/fybjfq0yjd5dk6hmd9jr2n2ymky4s9v0-mls-validation-service-x86_64-unknown-linux-musl-1.10.0.

  • Native sqlite invariance: native drv hashes identical with and without the overlay:

    • x86_64-linux: 53j5kr3aq86wnyczwlmlh0a0n48nhnk4-sqlite-3.51.2.drv (both)
    • aarch64-darwin: 9lkza91vfmh6h5f2r3vdg1w04m3ymn0f-sqlite-3.51.2.drv (both)

    ⇒ non-musl consumers continue to substitute from cache.nixos.org with no cache-miss regression.

  • nixfmt-rfc-style clean on nix/lib/default.nix.

  • CI (pending): the Cache all Nix Outputs workflow on warp-macos-26-arm64-12x is the only test that exercises the actual cold darwin-host cross path; this PR's CI run is the definitive end-to-end check.

Test plan

  • nix build of musl cross sqlite for both x86_64 and aarch64 succeeds locally.
  • nix derivation show confirms no tcl in the overridden sqlite's inputs.drvs, env.configureFlags, or env.nativeBuildInputs.
  • mls-validation-service-x86_64-unknown-linux-musl builds end-to-end from a cold store.
  • Native sqlite drv hashes unchanged on x86_64-linux and aarch64-darwin (no regression for non-musl consumers).
  • nixfmt-rfc-style clean.
  • CI Cache all Nix Outputs workflow passes on warp-macos-26-arm64-12x.

🤖 Generated with Claude Code

Note

Disable TCL in sqlite builds on musl cross hosts

Adds a Nix overlay in nix/lib/default.nix that overrides the sqlite derivation when the host platform is musl. The override strips any --with-tcl= configure flag, appends --disable-tcl, removes tcl-related entries from nativeBuildInputs, and sets doCheck = false.

Macroscope summarized 2d99c88.

xmtp-coder-agent and others added 3 commits April 10, 2026 17:20
Two bugs in tcl 8.6.16 (pinned via nixpkgs 09061f74...) were cascading
into failures of mls-validation-service, bindings-node-js-napi, and the
devour-output aggregates on the Cache all Nix Outputs workflow:

 1. compat/mkstemp.c calls strlen() without including <string.h>, which
    gcc 15 promotes from warning to error under the musl cross toolchain
    (-Wimplicit-function-declaration). Hit on x86_64-unknown-linux-musl
    from any build host.

 2. unix/tcl.m4's SC_CONFIG_SYSTEM reads `uname -s` on the *build* host
    to set tcl_cv_sys_version. On the warp-macos-26-arm64-12x runner
    this becomes Darwin-*, which selects the MAC_OSX_TCL / MAC_OSX_OBJS
    code path and tries to compile tclMacOSXFCmd.c / include
    mach/mach_time.h against a linux-musl sysroot. Hit on
    aarch64-unknown-linux-musl from the macOS runner.

Add an overlay in nix/lib/default.nix (applied to both native and
mkCrossPkgs imports) that:

  - appends a postPatch to inject `#include <string.h>` into
    compat/mkstemp.c via substituteInPlace --replace-fail (fail-loud
    if the marker line ever changes upstream);
  - appends a preConfigure that exports tcl_cv_sys_version=Linux when
    stdenv.hostPlatform.isLinux, so cross-builds to linux targets
    never consult the build host's uname;
  - rebinds tcl to the patched tcl-8_6 so the alias and explicit
    version resolve to the same derivation.

Verified locally:
  - x86_64-unknown-linux-musl tcl cross-build succeeds from cold cache
    (Bug 1 regression).
  - aarch64-unknown-linux-musl tcl cross-build succeeds with a stacked
    overlay that simulates the Darwin host by forcing
    tcl_cv_sys_version=Darwin-24.0.0 before configure. Without the fix
    this reproduces the macOS header errors; with the fix applied on
    top, the export is overridden and the build completes (Bug 2
    regression).
  - Native x86_64-linux tcl still builds and runs (tclsh returns
    8.6.16).
  - Native aarch64-darwin tcl derivation's preConfigure is unchanged
    (still just 'cd unix'), so native Darwin builds are untouched.

Remove this override when the nixpkgs pin is bumped past a rev that
adds the missing include and honors the autoconf host triple in
SC_CONFIG_SYSTEM.

Resolves xmtp#3444

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous commit (4d17631) tried to patch tcl 8.6.16 in an overlay
by exporting tcl_cv_sys_version=Linux in preConfigure, redirecting the
SC_CONFIG_SYSTEM autoconf macro. That was incomplete: tcl's unix
configure script has a *separate* `uname -s` check at configure.in:557
that unconditionally defines TCL_WIDE_CLICKS + MAC_OSX_TCL on Darwin
build hosts, which causes tclUnixTime.c to include <mach/mach_time.h>
against a linux-musl sysroot regardless of the autoconf cache variable.

A correct tcl patch would require either regenerating configure with
autoreconf or patching both the generated configure script and the
MAC_OSX_SRCS makefile variable — fragile across nixpkgs revisions.

Also, neekolas pointed out on the issue that `nix build
.#validation-service-image` works on his Apple Silicon Mac. Reason:
cache.nixos.org hosts the x86_64-linux-host cross tcl outputs (Hydra
builds them) but not the aarch64-darwin-host cross outputs. So the tcl
bugs only materialize on cold darwin-host cross builds — which is
exactly what the fh-cache workflow started doing after PR xmtp#3408
activated "build while caching" on the new warp-macos-26-arm64-12x
runner with a fresh flake.lock that invalidated the WarpBuilds cache.

Rather than fight tcl, sidestep it: sqlite only depends on tcl for its
tclsqlite3 extension and its test harness. libxmtp consumes libsqlite3
directly, so --disable-tcl is safe. sqlite's autosetup uses the bundled
autosetup/jimsh0.c for its own code generation when tcl is disabled
(see sqlite autosetup/sqlite-config.tcl). This also means no cross-tcl
gets built for any libxmtp target, since sqlite is the only consumer.

The overlay is gated on stdenv.hostPlatform.isMusl so native sqlite on
linux and darwin keeps substituting from cache.nixos.org unchanged.
Verified: native sqlite drv hash is identical with and without the
overlay on both x86_64-linux (53j5kr3aq86...) and aarch64-darwin
(9lkza91vfmh...), so no cache-miss regression for non-musl consumers.

Validation (local):
- x86_64-unknown-linux-musl sqlite cross-builds cleanly, closure has
  zero tcl entries (verified via `nix derivation show`).
- aarch64-unknown-linux-musl sqlite cross-builds cleanly.
- End-to-end: mls-validation-service-x86_64-unknown-linux-musl builds
  successfully from a cold store, proving the full failure chain is
  broken.

See xmtp#3444 for full analysis.
@xmtp-coder-agent xmtp-coder-agent changed the title fix(nix): patch tcl 8.6.16 for linux-musl cross-compile (mkstemp.c + SC_CONFIG_SYSTEM) fix(nix): drop tcl from sqlite on musl cross hosts to unblock fh-cache Apr 10, 2026
@neekolas neekolas marked this pull request as ready for review April 10, 2026 23:49
@neekolas neekolas requested a review from a team as a code owner April 10, 2026 23:49
Comment thread nix/lib/default.nix
# aarch64-darwin), so builds from a darwin host hit the bugs cold.
# See https://github.com/xmtp/libxmtp/issues/3444.
#
# Symptoms seen in CI on warp-macos-26-arm64-12x:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in reality we should probably replace all sqlite references with sqlcipher, since that's what we actually use, but this fix is ok for now

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack — and agreed that a sqlcipher migration would be a cleaner long-term story. For this PR I traced the actual closure to make sure the override lands in the right place:

$ nix-store --query --requisites \
    $(nix eval --raw .#packages.x86_64-linux.mls-validation-service-x86_64-unknown-linux-musl.drvPath) \
  | grep -E '(sqlite|sqlcipher|tcl)'

Findings:

  • sqlite-x86_64-unknown-linux-musl-3.51.2.drv is in the closure (pulled via libsqlite3-sysrusqlite) — this is the derivation the overlay targets, and nix derivation show on it confirms zero tcl in inputs.drvs, --disable-tcl in configureFlags, and no tcl in nativeBuildInputs.
  • sqlcipher is not in the closure at all for the mls-validation-service nix package — it's only referenced in nix/shells/{local,rust}.nix (dev shell) and nix/package/xnet-gui.nix. The bindings-node-js-napi and mls-validation-service musl builds link against plain libsqlite3 via libsqlite3-sys, not sqlcipher.
  • Native tcl-8.6.16.drv (the build-host tcl, not cross) remains in the closure — pulled by native sqlite as a build-time tool. That's fine because native tcl on darwin builds against darwin headers and the cross-compile bug is only tripped when --host=*-linux-musl meets tcl_cv_sys_version=Darwin.

So for the failing chain (tcl-*-unknown-linux-muslsqlite-*-unknown-linux-muslbindings-node-js-napi / mls-validation-service), dropping tcl from the musl cross sqlite is the minimum that breaks it. A full sqlcipher migration would be a bigger surface change and wouldn't move this issue forward — sqlcipher also has a tcl dep, it'd just be pulling native tcl (same as sqlite does today) rather than cross-tcl. Happy to open a follow-up issue for the sqlcipher unification if you'd like.

@neekolas neekolas merged commit 940fbb8 into xmtp:main Apr 13, 2026
32 checks passed
insipx pushed a commit that referenced this pull request Apr 13, 2026
…3472)

Resolves #3470

## Summary

- Pre-seeds `kyua_cv_getopt_plus=yes` in the atf package override when
cross-compiling (`buildPlatform != hostPlatform`), fixing the
`AC_RUN_IFELSE` failure that aborts `atf` configure with "cannot run
test program while cross compiling"
- Unblocks the `aarch64-apple-darwin` build chain: `atf -> libiconv ->
apple-sdk-14.4 -> bindings-node-js-napi-*`
- Gated on cross-compilation only, so native builds continue pulling
from `cache.nixos.org` unchanged

## Context

The `atf-0.23` configure script (`m4/module-application.m4`) uses
`AC_RUN_IFELSE` to check whether `getopt(3)` accepts a leading `+` for
POSIX behavior. During cross-compilation the compiled test binary cannot
execute on the build host, causing configure to abort. All target
platforms in this flake (Darwin, glibc Linux, musl Linux) support `+` in
getopt, so pre-seeding `yes` is correct.

This follows the same pattern as the sqlite/tcl cross-compilation fix
from #3450.

## Test plan

- [ ] CI "Cache all Nix Outputs" workflow passes on
`warp-macos-26-arm64-12x` runner
- [ ] Native Darwin builds still substitute from cache (no derivation
hash change for native atf)


<!-- Macroscope's pull request summary starts here -->
<!-- Macroscope will only edit the content between these invisible
markers, and the markers themselves will not be visible in the GitHub
rendered markdown. -->
<!-- If you delete either of the start / end markers from your PR's
description, Macroscope will append its summary at the bottom of the
description. -->
> [!NOTE]
> ### Pre-seed `kyua_cv_getopt_plus` configure flag in `atf` to fix
cross-compilation
> During cross-compilation, autoconf cache variables are not
automatically populated, causing the `atf` build to fail. The overlay in
[default.nix](https://github.com/xmtp/libxmtp/pull/3472/files#diff-6fd175d36064b2e9b9371596932bf201514494fc02c317f3f4526243d72991f1)
now appends `kyua_cv_getopt_plus=yes` to `atf`'s `configureFlags` when
`buildPlatform != hostPlatform`. Native builds are unaffected.
>
> <!-- Macroscope's review summary starts here -->
>
> <sup><a href="https://app.macroscope.com">Macroscope</a> summarized
4127f19.</sup>
> <!-- Macroscope's review summary ends here -->
>
<!-- macroscope-ui-refresh -->
<!-- Macroscope's pull request summary ends here -->

Co-authored-by: xmtp-coder-agent <xmtp-coder-agent@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky CI Failure: Cache all Nix Outputs fails — tcl cross-compile error (strlen implicit declaration)

3 participants