Skip to content

Commit 54c0729

Browse files
jahoomaclaude
andcommitted
Switch tree-sitter wasm embed from base64 string to with { type: 'file' }
The base64-in-source approach didn't survive `bun --compile` on Windows. The CI build's `verifyTreeSitterWasmEmbedded` step caught it: Embedded tree-sitter.wasm from D:\a\...\tree-sitter.wasm (205488 bytes → 273984 chars base64) [343ms] minify -16.58 MB Embedded tree-sitter wasm prefix not found in D:\a\...\codebuff.exe. So the embed step wrote the bytes to disk and bun read them, but the 274KB string literal didn't end up in the compiled output — likely tree-shaken or transformed by the minifier on Windows. The same code worked on macOS and Linux locally and in CI. Switch to Bun's documented asset-embed mechanism: import the wasm with `with { type: 'file' }`. Bun handles this through the bundler's asset pipeline rather than as a generic string literal, and the resulting binary contains the wasm bytes verbatim at a bunfs path. - cli/src/pre-init/tree-sitter-wasm.ts: import the wasm path, set the env var (for the locateFile fallback), and try a synchronous read so Parser.init can take the wasmBinary fast path. If the read throws (some Windows configurations have done this), log loudly so user reports include the diagnostic, then fall through to the locateFile flow — which init-node.ts now accepts bunfs paths through, even when fs.existsSync misreports them. - The --smoke-tree-sitter handler is now a top-level `await` instead of a fire-and-forget IIFE. Without that, commander.parse() ran synchronously in main() and failed on the unknown flag before the smoke handler could exit cleanly. - cli/scripts/build-binary.ts: drop the base64 stub-overwrite step entirely. New verifyTreeSitterWasmEmbedded reads a 64-byte chunk from the *middle* of the source wasm and asserts it appears in the compiled binary — that proves *this specific* tree-sitter.wasm shipped, not just any wasm (OpenTUI also embeds tree-sitter language wasms, so a magic-bytes-only scan would false-pass). - Delete cli/src/pre-init/tree-sitter-wasm-bytes.ts: no longer used. Verified locally: build embeds tree-sitter.wasm via the file-attribute import, post-build verification finds the source bytes at offset 77319353 of the compiled binary, --smoke-tree-sitter exits 0 with "tree-sitter smoke ok (wasmBinary, 205488 bytes)". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent ad6a900 commit 54c0729

3 files changed

Lines changed: 113 additions & 160 deletions

File tree

cli/scripts/build-binary.ts

Lines changed: 36 additions & 95 deletions
Original file line numberDiff line numberDiff line change
@@ -145,11 +145,6 @@ async function main() {
145145
patchOpenTuiAssetPaths()
146146
await ensureOpenTuiNativeBundle(targetInfo)
147147

148-
const treeSitterEmbed = embedTreeSitterWasmAsBase64()
149-
// Restore the stub even on build failure so a developer's git working
150-
// tree doesn't end up with a multi-megabyte modified file.
151-
process.on('exit', treeSitterEmbed.restore)
152-
153148
const outputFilename =
154149
targetInfo.platform === 'win32' ? `${binaryName}.exe` : binaryName
155150
const outputFile = join(binDir, outputFilename)
@@ -191,20 +186,12 @@ async function main() {
191186

192187
runCommand('bun', buildArgs, { cwd: cliRoot })
193188

194-
// Build done — restore the stub so a developer's working tree doesn't show
195-
// a multi-megabyte diff. (The exit handler above is a backstop for crashes;
196-
// the eager call here keeps a successful build clean.)
197-
treeSitterEmbed.restore()
198-
199-
// Fail the build if the wasm bytes didn't actually make it into the
200-
// compiled binary. Catches silent regressions (e.g. bun dropping a huge
201-
// string literal, or some future bundler optimization) before we ship a
202-
// broken artifact to users.
203-
verifyTreeSitterWasmEmbedded(
204-
outputFile,
205-
treeSitterEmbed.wasmBase64Prefix,
206-
treeSitterEmbed.wasmByteLength,
207-
)
189+
// Fail the build if the wasm asset didn't actually make it into the
190+
// compiled binary. The pre-init imports tree-sitter.wasm with `with {
191+
// type: 'file' }`, which Bun should embed; this scan catches silent
192+
// regressions (e.g. tree-shaking eliminating the import) before we ship
193+
// a broken artifact.
194+
verifyTreeSitterWasmEmbedded(outputFile)
208195

209196
if (targetInfo.platform !== 'win32') {
210197
chmodSync(outputFile, 0o755)
@@ -225,39 +212,20 @@ main().catch((error: unknown) => {
225212
})
226213

227214
/**
228-
* Inline the contents of `web-tree-sitter/tree-sitter.wasm` as a base64 string
229-
* literal in `cli/src/pre-init/tree-sitter-wasm-bytes.ts`. The committed
230-
* file is a stub; this overwrites it with the real bytes immediately before
231-
* `bun build --compile`, so the bytes get baked into the binary's text
232-
* segment instead of being placed at a bunfs path that has to be fs-read at
233-
* runtime.
215+
* Sanity-check the compiled binary actually contains web-tree-sitter's
216+
* tree-sitter.wasm. The pre-init imports it via `with { type: 'file' }`,
217+
* which should bundle the asset at a bunfs path. If tree-shaking or a
218+
* future bundler change drops the import, the binary still compiles but
219+
* tree-sitter init fails at runtime — this scan fails the build before
220+
* we upload that artifact.
234221
*
235-
* Returns a function that restores the stub. Always invoke it (success or
236-
* failure) so a developer's working tree doesn't show a multi-MB diff.
222+
* Looks for the actual wasm bytes (a unique 64-byte chunk pulled from
223+
* the source file's interior), not just the wasm magic header — OpenTUI
224+
* embeds its own tree-sitter language wasms, so a magic-bytes-only scan
225+
* would false-pass even without our import. A literal bytes match
226+
* proves *this specific* wasm shipped.
237227
*/
238-
function embedTreeSitterWasmAsBase64(): {
239-
restore: () => void
240-
wasmBase64Prefix: string
241-
wasmByteLength: number
242-
} {
243-
const stubPath = join(cliRoot, 'src', 'pre-init', 'tree-sitter-wasm-bytes.ts')
244-
const originalStub = readFileSync(stubPath, 'utf8')
245-
let restored = false
246-
const restore = (): void => {
247-
if (restored) return
248-
restored = true
249-
try {
250-
writeFileSync(stubPath, originalStub)
251-
} catch (error) {
252-
console.error('Failed to restore tree-sitter-wasm-bytes stub:', error)
253-
}
254-
}
255-
256-
// Try multiple candidate locations because bun's hoisting differs by
257-
// platform and install command — Windows CI does `bun install --cwd cli`
258-
// which can leave web-tree-sitter in cli/node_modules, while monorepo
259-
// root installs hoist it to ../node_modules. Fall back to createRequire
260-
// last so any failure surfaces with the full search trail.
228+
function verifyTreeSitterWasmEmbedded(outputFile: string): void {
261229
const candidates = [
262230
join(cliRoot, 'node_modules', 'web-tree-sitter', 'tree-sitter.wasm'),
263231
join(cliRoot, '..', 'node_modules', 'web-tree-sitter', 'tree-sitter.wasm'),
@@ -270,64 +238,37 @@ function embedTreeSitterWasmAsBase64(): {
270238
wasmPath = cliRequire.resolve('web-tree-sitter/tree-sitter.wasm')
271239
} catch (err) {
272240
throw new Error(
273-
`Could not locate web-tree-sitter/tree-sitter.wasm. Searched:\n - ` +
241+
`Could not locate web-tree-sitter/tree-sitter.wasm to verify against. Searched:\n - ` +
274242
candidates.join('\n - ') +
275243
`\nAnd createRequire failed: ${err instanceof Error ? err.message : String(err)}`,
276244
)
277245
}
278246
}
279247

280-
const wasmBytes = readFileSync(wasmPath)
281-
const base64 = wasmBytes.toString('base64')
282-
283-
const generated =
284-
`// AUTO-GENERATED by cli/scripts/build-binary.ts during \`bun build --compile\`.\n` +
285-
`// Restored to the empty stub after the build finishes — do not commit a\n` +
286-
`// non-empty value here.\n` +
287-
`export const TREE_SITTER_WASM_BASE64 = ${JSON.stringify(base64)}\n`
288-
289-
writeFileSync(stubPath, generated)
290-
// Always-on log (not behind VERBOSE) so CI shows which path was used and
291-
// whether the embed succeeded — this is the single most useful breadcrumb
292-
// when the runtime check fails on a user machine.
293-
logAlways(
294-
`Embedded tree-sitter.wasm from ${wasmPath} (${wasmBytes.length} bytes → ${base64.length} chars base64)`,
295-
)
296-
return {
297-
restore,
298-
wasmBase64Prefix: base64.slice(0, 40),
299-
wasmByteLength: wasmBytes.length,
300-
}
301-
}
248+
const wasm = readFileSync(wasmPath)
249+
// Take a 64-byte slice from the middle of the file. The header has
250+
// generic wasm magic + section markers; the tail can be padding. The
251+
// middle is densely packed code/data unique to this specific wasm
252+
// module.
253+
const needleStart = Math.floor(wasm.length / 2)
254+
const needle = wasm.subarray(needleStart, needleStart + 64)
302255

303-
/**
304-
* Sanity-check the compiled binary actually contains the embedded base64.
305-
* If bun --compile ever silently drops a large string literal, or our embed
306-
* step's file write didn't take effect before the bundle ran, we want the
307-
* build to fail here instead of producing a binary that crashes for users.
308-
*/
309-
function verifyTreeSitterWasmEmbedded(
310-
outputFile: string,
311-
wasmBase64Prefix: string,
312-
wasmByteLength: number,
313-
): void {
314256
const binary = readFileSync(outputFile)
315-
// Search as a Buffer so we don't have to load the whole binary as a UTF-8
316-
// string (binaries are not valid UTF-8 and toString would corrupt bytes).
317-
const needle = Buffer.from(wasmBase64Prefix, 'utf8')
318257
const idx = binary.indexOf(needle)
319258
if (idx === -1) {
320259
throw new Error(
321-
`Embedded tree-sitter wasm prefix not found in ${outputFile}.\n` +
322-
`Expected base64 prefix (first 40 chars): ${wasmBase64Prefix}\n` +
323-
`Original wasm size: ${wasmByteLength} bytes.\n` +
324-
`This means the build-binary.ts embed step ran but bun --compile\n` +
325-
`did not include the bytes in the output. The runtime smoke test\n` +
326-
`would fall back to path-based wasm resolution, which is broken on\n` +
327-
`Windows.`,
260+
`web-tree-sitter wasm content not found in ${outputFile}.\n` +
261+
`Source wasm: ${wasmPath} (${wasm.length} bytes)\n` +
262+
`Searched for 64 bytes from offset ${needleStart} of the source.\n` +
263+
`Either the \`with { type: 'file' }\` import in the pre-init was\n` +
264+
`tree-shaken out, or bun --compile didn't embed the asset on this\n` +
265+
`platform. The runtime tree-sitter init would fail with\n` +
266+
`"Internal error: tree-sitter.wasm not found".`,
328267
)
329268
}
330-
logAlways(`Verified embedded wasm prefix at offset ${idx} of compiled binary.`)
269+
logAlways(
270+
`Verified embedded tree-sitter.wasm at offset ${idx} of compiled binary (source: ${wasmPath}).`,
271+
)
331272
}
332273

333274
function patchOpenTuiAssetPaths() {

cli/src/pre-init/tree-sitter-wasm-bytes.ts

Lines changed: 0 additions & 16 deletions
This file was deleted.
Lines changed: 77 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -1,71 +1,99 @@
11
// Embed tree-sitter.wasm into the bun-compile binary so the SDK's tree-sitter
22
// parser singleton can find it at runtime. Must be the very first import in
33
// `index.tsx`: subsequent imports (the SDK / code-map) eagerly construct the
4-
// parser, and its init reads what we publish here on `globalThis`.
4+
// parser, and its init reads what we publish here on `globalThis` and via
5+
// the env var.
56
//
6-
// Why not `with { type: 'file' }` + a runtime fs read? That's what the prior
7-
// fix tried, and it silently failed on Windows: bun --compile reports the
8-
// embedded asset path as `B:\~BUN\root\...`, and on some Windows configs
9-
// `fs.readFileSync` of that path throws (caught silently), so the SDK fell
10-
// back to path-based resolution that also failed there.
11-
//
12-
// The base64 string in `tree-sitter-wasm-bytes.ts` is replaced with the real
13-
// wasm contents by `cli/scripts/build-binary.ts` right before `bun build
14-
// --compile` and restored after. The bytes end up in the binary's text
15-
// segment as a JS string literal — no filesystem step on the hot path. In
16-
// dev / unit tests the stub is empty and code-map falls back to the
17-
// node_modules wasm, which works because the file actually exists locally.
7+
// Why `with { type: 'file' }` rather than embedding base64 in TS source:
8+
// the latter doesn't survive `bun --compile` on Windows. The base64 string
9+
// gets dropped or transformed somewhere in the bundle/minify pipeline, so
10+
// the runtime sees an empty stub even though the build script wrote the
11+
// real bytes. `with { type: 'file' }` is Bun's documented asset-embed
12+
// path — the file gets placed at a bunfs location the runtime can read.
13+
14+
import { readFileSync } from 'fs'
1815

19-
import { TREE_SITTER_WASM_BASE64 } from './tree-sitter-wasm-bytes'
16+
// @ts-expect-error - Bun's `with { type: 'file' }` returns a string path; TS
17+
// has no loader for the .wasm subpath of web-tree-sitter's package exports.
18+
import treeSitterWasmPath from 'web-tree-sitter/tree-sitter.wasm' with {
19+
type: 'file',
20+
}
2021

2122
let embeddedWasm: Uint8Array | undefined
22-
if (TREE_SITTER_WASM_BASE64.length > 0) {
23-
const buf = Buffer.from(TREE_SITTER_WASM_BASE64, 'base64')
24-
embeddedWasm = new Uint8Array(buf.buffer, buf.byteOffset, buf.byteLength)
25-
// globalThis is the only cross-bundle channel: the SDK pre-built bundle
26-
// inlines its own copy of `init-node.ts`, so a module-level variable in
27-
// the source package isn't visible to the singleton initialized via the
28-
// SDK. Slice into a fresh Uint8Array view instead of handing over the
29-
// Buffer's shared underlying ArrayBuffer.
30-
;(
31-
globalThis as { __CODEBUFF_TREE_SITTER_WASM_BINARY__?: Uint8Array }
32-
).__CODEBUFF_TREE_SITTER_WASM_BINARY__ = embeddedWasm
23+
24+
if (treeSitterWasmPath) {
25+
// Path stays for the locateFile fallback in init-node.ts. That fallback
26+
// accepts bunfs-style paths (`/~BUN/root/...`) without checking
27+
// fs.existsSync, because fs.existsSync misreports those paths on Windows.
28+
// emscripten's wasm loader will fs.readFile them through its own runtime.
29+
process.env.CODEBUFF_TREE_SITTER_WASM_PATH = treeSitterWasmPath
30+
31+
// Also try a synchronous read so we can hand the bytes straight to
32+
// Parser.init via wasmBinary — bypassing locateFile entirely is the most
33+
// robust path. If readFileSync of the bunfs path throws on this OS (we've
34+
// seen this happen on Windows in some configurations), log it loudly so
35+
// the smoke check / user reports include the diagnostic, then fall
36+
// through to the locateFile flow.
37+
try {
38+
const buf = readFileSync(treeSitterWasmPath)
39+
embeddedWasm = new Uint8Array(buf.buffer, buf.byteOffset, buf.byteLength)
40+
;(
41+
globalThis as { __CODEBUFF_TREE_SITTER_WASM_BINARY__?: Uint8Array }
42+
).__CODEBUFF_TREE_SITTER_WASM_BINARY__ = embeddedWasm
43+
} catch (err) {
44+
console.error(
45+
'[tree-sitter pre-init] readFileSync failed for embedded wasm at',
46+
treeSitterWasmPath,
47+
'—',
48+
err instanceof Error ? err.message : String(err),
49+
)
50+
}
3351
}
3452

3553
// Deterministic CI gate: `<binary> --smoke-tree-sitter` proves the embed
3654
// shipped end-to-end. Lives here, in the very first import, on purpose:
3755
//
3856
// - We're testing whether the *embed* works. Going through commander +
39-
// initTreeSitterForNode would also pass via the path-resolution
40-
// fallback when the embed is empty (e.g. dev mode), giving false
41-
// positives that mask a broken production build.
57+
// initTreeSitterForNode would pass via the path-resolution fallback
58+
// when the embed is empty (e.g. dev mode), giving false positives that
59+
// mask a broken production build.
4260
// - Failing here, before any other module loads, gives a sharp signal:
43-
// the embed either worked or it didn't. No render-loop timing, no
44-
// commander wiring, no SDK init order to debug.
61+
// either the wasm reached the runtime or it didn't.
4562
//
46-
// Async IIFE because Parser.init returns a promise; process.exit tears
47-
// the process down before parallel top-level imports can fire side
48-
// effects we'd have to clean up.
63+
// Top-level await (not a fire-and-forget IIFE) because subsequent module
64+
// evaluation has to *wait* — otherwise `commander.parse()` runs first and
65+
// fails on the unknown flag before our handler can exit cleanly.
4966
if (process.argv.includes('--smoke-tree-sitter')) {
50-
void (async () => {
51-
try {
52-
if (!embeddedWasm) {
53-
console.error(
54-
'tree-sitter smoke FAIL: TREE_SITTER_WASM_BASE64 stub is empty — ' +
55-
'the build-binary.ts embed step did not run or did not write the file.',
56-
)
57-
process.exit(1)
58-
}
59-
const { Parser } = await import('web-tree-sitter')
67+
try {
68+
const { Parser } = await import('web-tree-sitter')
69+
// Prefer the wasmBinary path (no filesystem step). Fall back to
70+
// letting Parser.init resolve the path via its locateFile callback,
71+
// which init-node.ts wires up to accept bunfs paths even when
72+
// fs.existsSync says otherwise.
73+
if (embeddedWasm) {
6074
await Parser.init({ wasmBinary: embeddedWasm })
61-
// Marker grepped by cli/scripts/smoke-binary.ts — keep this exact text.
6275
console.log(
63-
`tree-sitter smoke ok (${embeddedWasm.byteLength} bytes wasm initialized)`,
76+
`tree-sitter smoke ok (wasmBinary, ${embeddedWasm.byteLength} bytes)`,
77+
)
78+
} else if (treeSitterWasmPath) {
79+
await Parser.init({
80+
locateFile: (name: string) =>
81+
name === 'tree-sitter.wasm' ? treeSitterWasmPath : name,
82+
})
83+
console.log(
84+
`tree-sitter smoke ok (locateFile, path=${treeSitterWasmPath})`,
85+
)
86+
} else {
87+
console.error(
88+
'tree-sitter smoke FAIL: no embedded wasm path. The `with { type: ' +
89+
"'file' }` import returned a falsy value, which means the bundler " +
90+
'did not embed the asset.',
6491
)
65-
process.exit(0)
66-
} catch (err) {
67-
console.error('tree-sitter smoke FAIL:', err)
6892
process.exit(1)
6993
}
70-
})()
94+
process.exit(0)
95+
} catch (err) {
96+
console.error('tree-sitter smoke FAIL:', err)
97+
process.exit(1)
98+
}
7199
}

0 commit comments

Comments
 (0)