perf(ast): bulk-insert AST nodes via native Rust/rusqlite#651
perf(ast): bulk-insert AST nodes via native Rust/rusqlite#651carlos-alm merged 8 commits intomainfrom
Conversation
Move AST node SQLite inserts from per-row JS iteration to a single native Rust transaction via napi-rs + rusqlite. The new bulkInsertAstNodes function opens the DB directly from Rust, pre-fetches parent node definitions, and inserts all rows in one transaction — eliminating the JS-native FFI overhead per row. The JS-side buildAstNodes tries the native fast path first (when all files have native astNodes arrays), falling back to the existing JS loop for WASM or mixed-engine scenarios. Target: astMs < 50ms on native full builds (was ~393ms).
|
Claude finished @carlos-alm's task —— View job PR Review: Bulk AST Node Insertion Performance Optimization
|
Greptile SummaryThis PR moves AST node insertion from a per-row JS loop into a single Rust/rusqlite transaction, targeting a drop in All five issues raised in the prior review round have been resolved:
The Confidence Score: 5/5Safe to merge — all prior critical issues resolved, implementation is correct and degrades gracefully to the existing JS path. All five previously-flagged issues (silent error swallowing, missing busy_timeout, null end_line semantics mismatch, partial-commit data duplication, bundled SQLite documentation) are fully addressed. The remaining code is straightforward: the Rust function is correctly atomic, the JS fallback is intact, the parent-resolution logic is verified to match JS semantics across all edge cases, and serde is already a direct dependency so the derive macros compile cleanly. No new P0 or P1 issues found. No files require special attention. Important Files Changed
Reviews (4): Last reviewed commit: "Merge branch 'perf/ast-node-bulk-insert'..." | Re-trigger Greptile |
| if (!needsJsFallback) { | ||
| const inserted = native.bulkInsertAstNodes(db.name, batches); | ||
| debug(`AST extraction (native bulk): ${inserted} nodes stored`); | ||
| return; | ||
| } |
There was a problem hiding this comment.
Unconditional early return swallows Rust-side failures silently
bulkInsertAstNodes returns 0 for two completely different situations: (a) legitimately zero rows to insert, and (b) any of several hard error conditions in the Rust layer (DB open failure, ast_nodes table missing, conn.transaction() fails with SQLITE_BUSY, prepared-statement error, or tx.commit() failure). The JS code currently treats both outcomes identically and returns early without ever attempting the JS fallback path.
In error scenario (b) every AST node in every file in the batch is silently dropped — no warning is emitted and no fallback occurs. Given that the Rust connection opens the same WAL file without a busy_timeout (see companion comment), a SQLITE_BUSY during conn.transaction() is a realistic trigger.
A minimal fix is to count expected nodes before calling and fall back if the returned count doesn't match:
if (!needsJsFallback) {
const expectedNodes = batches.reduce((s, b) => s + b.nodes.length, 0);
const inserted = native.bulkInsertAstNodes(db.name, batches);
if (inserted === expectedNodes) {
debug(`AST extraction (native bulk): ${inserted} nodes stored`);
return;
}
debug(`AST extraction (native bulk): expected ${expectedNodes}, got ${inserted} — falling back to JS`);
// fall through to JS path
}
Alternatively, change the Rust function signature to Result<u32, napi::Error> so errors are propagated as JS exceptions that can be caught and trigger the fallback.
There was a problem hiding this comment.
Fixed — the native bulk-insert call now computes expectedNodes before calling and compares it against the returned count. On mismatch (including Rust-side errors that return 0), we log a debug message and fall through to the JS fallback path instead of returning early.
crates/codegraph-core/src/ast_db.rs
Outdated
| // Match the JS-side performance pragmas | ||
| let _ = conn.execute_batch("PRAGMA synchronous = NORMAL"); |
There was a problem hiding this comment.
Missing
busy_timeout pragma turns SQLITE_BUSY into silent data loss
The comment says "Match the JS-side performance pragmas", but src/db/connection.ts also sets db.pragma('busy_timeout = 5000') on the JS connection. The Rust connection omits this, so conn.transaction() will immediately fail with SQLITE_BUSY if better-sqlite3 happens to hold a WAL write lock at the moment this function is called (e.g., during a concurrent write in the same build phase).
Without a timeout, conn.transaction() returns an error, the function returns 0, and — because the JS caller makes no distinction between 0-rows-success and 0-rows-error — all AST nodes are silently dropped.
// Match the JS-side performance pragmas
let _ = conn.execute_batch(
"PRAGMA synchronous = NORMAL; PRAGMA busy_timeout = 5000",
);There was a problem hiding this comment.
Fixed — added \ to the Rust connection's pragma batch, matching the JS-side \ setting. This ensures WAL contention retries for 5 seconds instead of failing immediately with SQLITE_BUSY.
| tree-sitter-php = "0.23" | ||
| tree-sitter-hcl = "1" | ||
| rayon = "1" | ||
| rusqlite = { version = "0.32", features = ["bundled"] } |
There was a problem hiding this comment.
bundled feature embeds a second independent copy of SQLite into the process
With features = ["bundled"], rusqlite statically compiles its own SQLite C library. better-sqlite3 already bundles its own copy. The process will therefore contain two separate SQLite runtime instances that each open the same .db file via OS-level file locking.
This is functionally correct — WAL-mode coordination is handled by the OS — but it has a few downsides:
- Binary size grows (SQLite is ~1 MB of C compiled)
- rusqlite's bundled SQLite version is independent of better-sqlite3's, requiring separate tracking when a SQLite CVE lands
If the target platform is guaranteed to have a system SQLite (Linux/macOS), removing bundled and relying on libsqlite3-sys with dynamic linking would keep a single SQLite instance. If portability is the priority (e.g., Windows CI without system SQLite), bundled is the safer choice — just worth a conscious decision here.
There was a problem hiding this comment.
Added a comment documenting the rationale: bundled is intentional because Windows CI lacks a system SQLite, and WAL coordination between the two instances is handled safely at the OS level. Removing it would break Windows builds.
The Rust connection omitted busy_timeout = 5000 which the JS-side connection.ts sets. Without it, SQLITE_BUSY is returned immediately on WAL contention instead of retrying for 5 seconds.
Explain why bundled is intentional: Windows CI lacks system SQLite, and dual-instance WAL coordination is OS-safe.
| fn find_parent_id(defs: &[NodeDef], line: u32) -> Option<i64> { | ||
| let mut best_id: Option<i64> = None; | ||
| let mut best_span = u32::MAX; | ||
| for d in defs { | ||
| if d.line <= line { | ||
| if let Some(el) = d.end_line { | ||
| if el >= line { | ||
| let span = el - d.line; | ||
| if span < best_span { | ||
| best_id = Some(d.id); | ||
| best_span = span; | ||
| } | ||
| } | ||
| } | ||
| } | ||
| } | ||
| best_id | ||
| } |
There was a problem hiding this comment.
find_parent_id silently disagrees with JS findParentDef on null end_line
The JS findParentDef (ast.ts line 54) explicitly handles def.endLine == null as a valid enclosing parent:
if (def.line <= line && (def.endLine == null || def.endLine >= line)) {A definition with endLine == null is treated as though it always encloses the queried line, and the tie-breaking span formula (def.endLine ?? 0) - def.line gives it a negative span, which actually makes it preferred over definitions that have an explicit endLine.
The Rust implementation skips every definition where end_line is None entirely — only the if let Some(el) = d.end_line branch executes. A node whose only enclosing definition has end_line = NULL in the nodes table will receive parent_node_id = NULL from the native path but a non-null value from the JS path.
This violates the PR's "no behavior change — same data inserted" guarantee. To match JS semantics, treat end_line = None as always enclosing with a negative sentinel span:
fn find_parent_id(defs: &[NodeDef], line: u32) -> Option<i64> {
let mut best_id: Option<i64> = None;
let mut best_span: i64 = i64::MAX;
for d in defs {
if d.line <= line {
let span: i64 = match d.end_line {
Some(el) if el >= line => (el - d.line) as i64,
Some(_) => continue,
None => -(d.line as i64), // mirrors JS: (null ?? 0) - line
};
if span < best_span {
best_id = Some(d.id);
best_span = span;
}
}
}
best_id
}There was a problem hiding this comment.
Fixed — updated find_parent_id to match JS findParentDef semantics: definitions with end_line = NULL are now treated as always-enclosing with a negative sentinel span, so they are preferred over wider definitions just like in the JS path.
The Rust find_parent_id skipped definitions with end_line = NULL, but the JS findParentDef treats them as always-enclosing with a negative span (preferred over wider defs). This caused parent_node_id mismatches between native and JS paths.
crates/codegraph-core/src/ast_db.rs
Outdated
| if insert_stmt | ||
| .execute(params![ | ||
| &batch.file, | ||
| node.line, | ||
| &node.kind, | ||
| &node.name, | ||
| &node.text, | ||
| &node.receiver, | ||
| parent_id, | ||
| ]) | ||
| .is_ok() | ||
| { | ||
| total += 1; | ||
| } | ||
| } |
There was a problem hiding this comment.
Individual execute failures commit partial data and misfire JS fallback
insert_stmt.execute(…).is_ok() silently discards row-level errors and lets the outer transaction continue. If any row fails to insert (e.g., a NOT NULL violation on an unexpected NULL in incoming data, or a future schema constraint), the transaction will commit with total < expectedNodes. The JS caller interprets the mismatch as a Rust-side catastrophic failure and re-runs the entire JS path on top of the already-partially-committed rows. Because ast_nodes has no UNIQUE constraint, this produces duplicate rows for every successfully-inserted Rust row.
The fix is to treat any row-level execute failure as fatal for the whole transaction — returning 0 immediately so the transaction drops and rolls back, ensuring all-or-nothing semantics:
for node in &batch.nodes {
let parent_id = find_parent_id(defs, node.line);
match insert_stmt.execute(params![
&batch.file,
node.line,
&node.kind,
&node.name,
&node.text,
&node.receiver,
parent_id,
]) {
Ok(_) => total += 1,
Err(_) => return 0, // triggers JS fallback; tx rolls back on drop
}
}There was a problem hiding this comment.
Fixed — replaced .is_ok() with a match that returns 0 on the first row-level execute() failure. This drops the transaction (rolling back all inserts) and triggers the JS fallback cleanly, ensuring all-or-nothing semantics with no partial commits or duplicate rows.
Return 0 immediately on any insert_stmt.execute() failure so the transaction drops and rolls back, ensuring all-or-nothing semantics. Previously, .is_ok() silently swallowed row-level errors which could commit partial data and misfire the JS fallback causing duplicate rows.
…/codegraph into perf/ast-node-bulk-insert

Summary
bulk_insert_ast_nodesnapi function in newcrates/codegraph-core/src/ast_db.rs— opens SQLite directly from Rust via rusqlite, pre-fetches parent node definitions, and inserts all AST rows in a single transactionbuildAstNodestries the native fast path first (when all files have nativeastNodesarrays), falling back to existing JS loop for WASM/mixed-engine scenariosbulkInsertAstNodestoNativeAddoninterface insrc/types.tsMotivation
AST node insertion was the bottleneck in the analysis phase (~393ms), iterating per-row through JS with individual
insertStmt.run()calls plusbulkNodeIdsByFile()per file. This moves the entire operation to a single Rust transaction, targeting astMs < 50ms on native full builds.Test plan
astMstiming in build output drops significantly on native engine builds