perf(insert-nodes): native Rust/rusqlite pipeline for node insertion#654
perf(insert-nodes): native Rust/rusqlite pipeline for node insertion#654carlos-alm merged 8 commits intomainfrom
Conversation
Four root causes fixed across both engines: WASM JS/TS: extractConstantsWalk only walked root-level children, missing constants inside top-level blocks (if, try, for). Now recurses into all non-function scopes, matching native's recursive walk. WASM JS/TS: query pattern only matched property_identifier for methods, missing private #methods. Added private_property_identifier pattern. WASM Rust: extractRustParameters included self_parameter nodes that native correctly skips. Now skips them. Native JS: find_parent_of_types missed generator_function_declaration and generator_function, leaking constants from generator function bodies. Added both types to the scope filter. Native Rust: function_item handler emitted a duplicate entry for trait default-impl functions already captured by trait_item handler. Added parent guard matching WASM's existing skip logic.
|
Claude finished @carlos-alm's task —— View job Reviewing PR: fix WASM-native engine parity gap Root Cause AnalysisProblem Being Solved: This PR addresses a 26-node divergence between WASM and native parsing engines by fixing 5 specific extraction issues across both engines. Why This Approach: The approach directly fixes root causes in the parsing logic rather than masking symptoms. Each fix targets a specific extraction inconsistency. Risk Assessment: ✅ LOW RISK - The changes are surgical fixes to known issues with solid test coverage. Backlog Compliance
Code Quality Review✅ Strong Points
Detailed Fix AnalysisWASM JS/TS Fixes (
Native JS Fix ( WASM Rust Fix ( Native Rust Fix ( ✅ Test Strategy
Final AssessmentThis is exemplary engineering work. The PR:
The approach follows the project's principle: "Never document bugs as expected behavior" - instead of accepting the divergence, it systematically fixes the underlying extraction issues. Rating: ⭐⭐⭐⭐⭐ (5/5) Action: APPROVE Reasoning: This PR directly improves the core foundation of codegraph (parsing accuracy) with zero risk, excellent documentation, and proper testing. The fixes are surgical, well-reasoned, and address real extraction bugs rather than masking symptoms. |
Greptile SummaryThis PR introduces a native Rust/rusqlite fast-path for the node-insertion pipeline stage, replacing the JS↔better-sqlite3 loop with a single multi-phase SQLite transaction executed entirely in Rust ( The PR also bundles several extractor correctness fixes:
Key findings:
Confidence Score: 5/5Safe to merge — the Rust transaction is atomic with a correct JS fallback, all remaining findings are P2 style/improvement suggestions with no correctness impact on the main path. Previous P0/P1 concerns (duplicate edges, duplicate constants) are confirmed fixed. The query_node_ids flat-key widening is a theoretical edge case matching the JS fallback's existing behaviour. The removed CFG/dataflow native paths are intentional scope decisions. No outstanding critical issues. crates/codegraph-core/src/insert_nodes.rs (flat key scope widening in query_node_ids); src/features/cfg.ts and src/features/dataflow.ts (silent removal of native fast-paths). Important Files Changed
Sequence DiagramsequenceDiagram
participant JS as insert-nodes.ts
participant Rust as bulk_insert_nodes (Rust)
participant SQLite as SQLite DB
JS->>JS: populate ctx.fileSymbols
JS->>JS: engineName === 'native'?
alt Native path
JS->>Rust: bulkInsertNodes(dbPath, batches, fileHashes, removed)
Rust->>SQLite: PRAGMA synchronous=NORMAL
Rust->>SQLite: BEGIN TRANSACTION
Rust->>SQLite: Phase 1 — INSERT OR IGNORE nodes (file + defs + exports)
Rust->>SQLite: Phase 1b — UPDATE nodes SET exported=1
Rust->>SQLite: Phase 2 — SELECT node IDs, INSERT OR IGNORE children
Rust->>SQLite: Phase 3 — re-SELECT IDs, build contains/parameter_of edges
Rust->>SQLite: INSERT OR IGNORE edges
Rust->>SQLite: Phase 4 — UPSERT file_hashes, DELETE removed
Rust->>SQLite: COMMIT
Rust-->>JS: true (success)
JS->>JS: record timing, return
else WASM / native unavailable / Rust returned false
JS->>SQLite: JS transaction (insertDefinitionsAndExports)
JS->>SQLite: JS transaction (insertChildrenAndEdges)
JS->>SQLite: JS transaction (updateFileHashes)
end
Reviews (4): Last reviewed commit: "fix(rust): use INSERT OR IGNORE for edge..." | Re-trigger Greptile |
src/extractors/javascript.ts
Outdated
| let declNode = child; | ||
| // Handle `export const …` — unwrap the export_statement to its declaration child | ||
| if (node.type === 'export_statement') { | ||
| const inner = node.childForFieldName('declaration'); | ||
| if (!inner) continue; | ||
| declNode = inner; | ||
| if (child.type === 'export_statement') { | ||
| const inner = child.childForFieldName('declaration'); | ||
| if (inner) declNode = inner; | ||
| } | ||
|
|
||
| const t = declNode.type; | ||
| if (t !== 'lexical_declaration' && t !== 'variable_declaration') continue; | ||
| if (!declNode.text.startsWith('const ')) continue; | ||
|
|
||
| for (let j = 0; j < declNode.childCount; j++) { | ||
| const declarator = declNode.child(j); | ||
| if (!declarator || declarator.type !== 'variable_declarator') continue; | ||
| const nameN = declarator.childForFieldName('name'); | ||
| const valueN = declarator.childForFieldName('value'); | ||
| if (!nameN || nameN.type !== 'identifier' || !valueN) continue; | ||
| // Skip functions — already captured by query patterns | ||
| const valType = valueN.type; | ||
| if ( | ||
| valType === 'arrow_function' || | ||
| valType === 'function_expression' || | ||
| valType === 'function' | ||
| ) | ||
| continue; | ||
| if (isConstantValue(valueN)) { | ||
| definitions.push({ | ||
| name: nameN.text, | ||
| kind: 'constant', | ||
| line: declNode.startPosition.row + 1, | ||
| endLine: nodeEndLine(declNode), | ||
| }); | ||
| if (t === 'lexical_declaration' || t === 'variable_declaration') { | ||
| if (declNode.text.startsWith('const ')) { | ||
| for (let j = 0; j < declNode.childCount; j++) { | ||
| const declarator = declNode.child(j); | ||
| if (!declarator || declarator.type !== 'variable_declarator') continue; | ||
| const nameN = declarator.childForFieldName('name'); | ||
| const valueN = declarator.childForFieldName('value'); | ||
| if (!nameN || nameN.type !== 'identifier' || !valueN) continue; | ||
| // Skip functions — already captured by query patterns | ||
| const valType = valueN.type; | ||
| if ( | ||
| valType === 'arrow_function' || | ||
| valType === 'function_expression' || | ||
| valType === 'function' | ||
| ) | ||
| continue; | ||
| if (isConstantValue(valueN)) { | ||
| definitions.push({ | ||
| name: nameN.text, | ||
| kind: 'constant', | ||
| line: declNode.startPosition.row + 1, | ||
| endLine: nodeEndLine(declNode), | ||
| }); | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // Recurse into non-function children (blocks, if-statements, etc.) | ||
| extractConstantsWalk(child, definitions); |
There was a problem hiding this comment.
Exported constants counted twice
When child.type === 'export_statement', the code:
- Unwraps
inner = child.childForFieldName('declaration')(thelexical_declaration) - Processes
declNode = inner→ pushes the constant - Then unconditionally calls
extractConstantsWalk(child, definitions)— recursing into theexport_statementitself
During that recursion the lexical_declaration is found again as a direct child of export_statement (.child(i) returns all children including named-field children), declNode is set to it, and the constant is pushed a second time.
This only affects top-level export const X = value declarations, but every such constant will appear as a duplicate in the definitions array, producing ghost nodes in the graph.
The minimal fix at the bottom of the loop:
// Recurse into non-function, non-export-statement children
if (child.type !== 'export_statement') {
extractConstantsWalk(child, definitions);
}There was a problem hiding this comment.
Fixed — added a guard to skip recursion into export_statement nodes in extractConstantsWalk. After unwrapping and processing the inner lexical_declaration, the function now skips recursing into the export_statement itself, preventing the duplicate. See commit 9215de9.
Move the entire insert-nodes stage to Rust when the native engine is available. The new `bulk_insert_nodes` napi function receives marshalled definitions, exports, children, and file-hash data from JS and writes everything to SQLite via rusqlite in a single transaction — eliminating the JS↔better-sqlite3 FFI overhead that dominated the ~206ms timing. Rust path: prepare_cached + individual executes (zero per-row FFI cost). JS fallback preserved unchanged for WASM engine or native failure. Target: insertMs < 50 ms on native full builds.
The merge commit a8849c3 brought in `pub mod cfg_db` and `pub mod dataflow_db` declarations in lib.rs without the corresponding source files, causing Rust compilation to fail (E0583: file not found). This branch supersedes those modules with the unified insert_nodes.rs pipeline, so remove the stale mod declarations and their NativeAddon type signatures in types.ts.
|
Addressed both P2 findings from the Greptile review:
Also fixed the root cause of all Rust CI failures: the merge commit a8849c3 added |
| // ── Insert all edges ───────────────────────────────────────────── | ||
| { | ||
| let mut stmt = tx.prepare_cached( | ||
| "INSERT INTO edges (source_id, target_id, kind, confidence, dynamic) \ | ||
| VALUES (?1, ?2, ?3, ?4, ?5)", | ||
| )?; | ||
| for &(src, tgt) in &contains_edges { | ||
| stmt.execute(params![src, tgt, "contains", 1.0, 0])?; | ||
| } | ||
| for &(src, tgt) in ¶m_of_edges { | ||
| stmt.execute(params![src, tgt, "parameter_of", 1.0, 0])?; | ||
| } |
There was a problem hiding this comment.
Edge INSERT missing conflict handler — duplicates on incremental builds
The edges table has no UNIQUE constraint (confirmed in src/db/migrations.ts), so INSERT INTO edges never fails — it silently inserts duplicate edges whenever bulk_insert_nodes is called for a file whose edges were not fully cleaned up beforehand.
The incremental stage (incremental.ts:126) only deletes edges where source_id is in the changed file's nodes:
DELETE FROM edges WHERE source_id IN (SELECT id FROM nodes WHERE file = ?)Edges where a changed-file node appears as the target are not deleted. When the native path then re-inserts fresh containment/parameter_of edges, those target-side edges remain, and the new ones are appended — producing duplicate graph edges that accumulate with each incremental rebuild.
Suggest using INSERT OR IGNORE to match the behaviour of the node inserts:
let mut stmt = tx.prepare_cached(
"INSERT OR IGNORE INTO edges (source_id, target_id, kind, confidence, dynamic) \
VALUES (?1, ?2, ?3, ?4, ?5)",
)?;There was a problem hiding this comment.
Fixed — changed the edge INSERT to INSERT OR IGNORE in do_insert (commit 2b6e30d). This matches the node insert behaviour and prevents duplicate edge rows from accumulating on incremental builds where target-side edges weren't pre-cleaned.
Regarding the other two findings from the summary review:
-
precomputed.stat.mtimewithoutMath.floor: This is actually correct —precomputed.stat.mtimeis already pre-floored at the source indetect-changes.ts:444(mtime: Math.floor(c.stat.mtimeMs)), so applyingMath.flooragain would be a no-op. TherawStat.mtimeMsbranches needMath.floorbecause they come directly fromfs.statSync. -
query_node_idsflat key collision: Acknowledged as a pre-existing limitation carried over from the JS fallback. Same-name parameters in different functions would need a parent-qualified key to fully disambiguate — but this is a pre-existing concern, not introduced by this PR.
…cremental builds (#654) The edges INSERT in do_insert lacked a conflict handler. Since the edges table has no UNIQUE constraint, incremental builds where target-side edges weren't pre-cleaned could silently accumulate duplicate rows. Switch to INSERT OR IGNORE to match the node insert behaviour.
Summary
bulk_insert_nodesnapi function incrates/codegraph-core/src/insert_nodes.rsthat writes nodes, children, containment edges, exports, and file hashes directly to SQLite via rusqlite in a single transactionallSymbolsto lean batch format and delegate all DB writes to Rust — eliminating JS↔better-sqlite3 FFI overheadTarget:
insertMs< 50ms on native full builds (current: ~206ms)Test plan