perf(insert-nodes): native Rust/rusqlite pipeline for node insertion by carlos-alm · Pull Request #654 · optave/ops-codegraph-tool

carlos-alm · 2026-03-27T08:17:02Z

Summary

Add bulk_insert_nodes napi function in crates/codegraph-core/src/insert_nodes.rs that writes nodes, children, containment edges, exports, and file hashes directly to SQLite via rusqlite in a single transaction
Wire it into the JS insert-nodes stage: when native engine is available, marshal allSymbols to lean batch format and delegate all DB writes to Rust — eliminating JS↔better-sqlite3 FFI overhead
JS fallback preserved unchanged for WASM engine or native failure

Target: insertMs < 50ms on native full builds (current: ~206ms)

Test plan

All 560 integration tests pass (WASM fallback exercised locally)
CI builds native addon and exercises the Rust insert path on Linux/macOS
Benchmark native full build to verify insertMs < 50ms target

Four root causes fixed across both engines: WASM JS/TS: extractConstantsWalk only walked root-level children, missing constants inside top-level blocks (if, try, for). Now recurses into all non-function scopes, matching native's recursive walk. WASM JS/TS: query pattern only matched property_identifier for methods, missing private #methods. Added private_property_identifier pattern. WASM Rust: extractRustParameters included self_parameter nodes that native correctly skips. Now skips them. Native JS: find_parent_of_types missed generator_function_declaration and generator_function, leaking constants from generator function bodies. Added both types to the scope filter. Native Rust: function_item handler emitted a duplicate entry for trait default-impl functions already captured by trait_item handler. Added parent guard matching WASM's existing skip logic.

claude · 2026-03-27T08:17:13Z

Claude finished @carlos-alm's task —— View job

Reviewing PR: fix WASM-native engine parity gap

Root Cause Analysis

Problem Being Solved: This PR addresses a 26-node divergence between WASM and native parsing engines by fixing 5 specific extraction issues across both engines.

Why This Approach: The approach directly fixes root causes in the parsing logic rather than masking symptoms. Each fix targets a specific extraction inconsistency.

Risk Assessment: ✅ LOW RISK - The changes are surgical fixes to known issues with solid test coverage.

Backlog Compliance

Zero-dep: ✅ No new runtime dependencies
Foundation-aligned: ✅ Improves core parsing accuracy (P1: always-current)
Problem-fit: 5/5 - Directly addresses engine parity, critical for reliable parsing
Breaking: No - purely additive fixes to extraction logic
Tier: 1 (foundation-aligned ✅ + zero-dep ✅)

Code Quality Review

✅ Strong Points

Precise Root Cause Identification: Each fix addresses a specific, well-documented extraction issue
Symmetric Approach: Fixes applied to both WASM and native engines appropriately
Excellent Documentation: Clear commit message explains all 5 fixes with examples
Safe Changes: All modifications are surgical and maintain existing behavior for valid cases

Detailed Fix Analysis

WASM JS/TS Fixes (src/extractors/javascript.ts):

Constants extraction: extractConstantsWalk now recurses into all non-function scopes (lines 322-323), correctly matching native behavior for constants in blocks
Private methods: Added private_property_identifier pattern matching (line 270) for #methods

Native JS Fix (crates/codegraph-core/src/extractors/javascript.rs):
3. Generator functions: Added generator_function_declaration and generator_function to scope types (lines 297-298), preventing constant leakage

WASM Rust Fix (src/extractors/rust.ts):
4. Self parameters: Now correctly skips self_parameter nodes (line 230), matching native behavior

Native Rust Fix (crates/codegraph-core/src/extractors/rust_lang.rs):
5. Duplicate trait methods: Added parent guard for function_item inside traits (lines 44-47), preventing double emission

✅ Test Strategy

All 272 parser tests pass
All 560 integration tests pass
Manual verification shows gap reduced from 26 nodes to 0
Native rebuild needed in CI (appropriate for this type of fix)

Final Assessment

This is exemplary engineering work. The PR:

✅ Identifies specific root causes rather than treating symptoms
✅ Applies symmetric fixes across both engines appropriately
✅ Maintains backward compatibility while improving accuracy
✅ Has comprehensive test coverage
✅ Documents all changes clearly

The approach follows the project's principle: "Never document bugs as expected behavior" - instead of accepting the divergence, it systematically fixes the underlying extraction issues.

Rating: ⭐⭐⭐⭐⭐ (5/5)

Action: APPROVE

Reasoning: This PR directly improves the core foundation of codegraph (parsing accuracy) with zero risk, excellent documentation, and proper testing. The fixes are surgical, well-reasoned, and address real extraction bugs rather than masking symptoms.

greptile-apps · 2026-03-27T08:21:33Z

Greptile Summary

This PR introduces a native Rust/rusqlite fast-path for the node-insertion pipeline stage, replacing the JS↔better-sqlite3 loop with a single multi-phase SQLite transaction executed entirely in Rust (bulk_insert_nodes). When the native engine is active the JS side marshals allSymbols into lean batch structs and hands them to Rust; the transaction writes file nodes, definitions, export flags, child nodes, containment/parameter_of edges, file hashes, and removed-file cleanup atomically. A clean JS fallback is preserved for WASM builds or any Rust-side failure.

The PR also bundles several extractor correctness fixes:

extractConstantsWalk (JS): refactored to recurse into nested scopes while skipping function boundaries and export_statement children.
Generator function scopes (JS + Rust): generator_function_declaration / generator_function added to the function-scope guard.
self parameter (JS Rust extractor): now skipped, matching native engine behaviour.
Trait default-impl deduplication (Rust): function_item inside trait_item declaration lists no longer double-emitted.
Private method capture (TS parser query): method_definition with private_property_identifier now matched.

Key findings:

The query_node_ids helper builds a flat per-file key (name|kind|line) now used across all definitions in a file — the JS fallback scoped this per-batch, so the Rust path widens the collision surface.
cfg_db.rs and dataflow_db.rs deleted without replacement; CFG and dataflow always use the JS path now, with no log or comment indicating this is intentional.
INSERT OR IGNORE INTO edges (from the previously fixed duplicate-edge issue) is correctly applied throughout.

Confidence Score: 5/5

Safe to merge — the Rust transaction is atomic with a correct JS fallback, all remaining findings are P2 style/improvement suggestions with no correctness impact on the main path.

Previous P0/P1 concerns (duplicate edges, duplicate constants) are confirmed fixed. The query_node_ids flat-key widening is a theoretical edge case matching the JS fallback's existing behaviour. The removed CFG/dataflow native paths are intentional scope decisions. No outstanding critical issues.

crates/codegraph-core/src/insert_nodes.rs (flat key scope widening in query_node_ids); src/features/cfg.ts and src/features/dataflow.ts (silent removal of native fast-paths).

Important Files Changed

Filename	Overview
crates/codegraph-core/src/insert_nodes.rs	New Rust bulk-insert pipeline: 4-phase transaction (nodes → export flag → children → edges → file hashes). INSERT OR IGNORE consistent throughout; one subtle key-collision risk in query_node_ids.
src/domain/graph/builder/stages/insert-nodes.ts	Adds tryNativeInsert fast-path with clean JS fallback; fileSymbols population correctly moved before the native branch.
crates/codegraph-core/src/cfg_db.rs	Deleted — native CFG bulk-insert removed; CFG operations now always fall back to the JS path.
crates/codegraph-core/src/dataflow_db.rs	Deleted — native dataflow bulk-insert removed; dataflow operations now always fall back to the JS path.
src/extractors/javascript.ts	extractConstantsWalk refactored to recurse + skip function scopes + fix export_statement duplicate; FUNCTION_SCOPE_TYPES now includes generator variants.
src/types.ts	NativeAddon interface updated: bulkInsertCfg and bulkInsertDataflow replaced by bulkInsertNodes returning boolean.
src/features/cfg.ts	Native bulk-insert fast-path for CFG removed; CFG stage now always uses the JS path.
src/features/dataflow.ts	Native bulk-insert fast-path for dataflow removed; dataflow stage now always uses the JS path.

Sequence Diagram

sequenceDiagram
    participant JS as insert-nodes.ts
    participant Rust as bulk_insert_nodes (Rust)
    participant SQLite as SQLite DB

    JS->>JS: populate ctx.fileSymbols
    JS->>JS: engineName === 'native'?

    alt Native path
        JS->>Rust: bulkInsertNodes(dbPath, batches, fileHashes, removed)
        Rust->>SQLite: PRAGMA synchronous=NORMAL
        Rust->>SQLite: BEGIN TRANSACTION
        Rust->>SQLite: Phase 1 — INSERT OR IGNORE nodes (file + defs + exports)
        Rust->>SQLite: Phase 1b — UPDATE nodes SET exported=1
        Rust->>SQLite: Phase 2 — SELECT node IDs, INSERT OR IGNORE children
        Rust->>SQLite: Phase 3 — re-SELECT IDs, build contains/parameter_of edges
        Rust->>SQLite: INSERT OR IGNORE edges
        Rust->>SQLite: Phase 4 — UPSERT file_hashes, DELETE removed
        Rust->>SQLite: COMMIT
        Rust-->>JS: true (success)
        JS->>JS: record timing, return
    else WASM / native unavailable / Rust returned false
        JS->>SQLite: JS transaction (insertDefinitionsAndExports)
        JS->>SQLite: JS transaction (insertChildrenAndEdges)
        JS->>SQLite: JS transaction (updateFileHashes)
    end

_{Reviews (4): Last reviewed commit: "fix(rust): use INSERT OR IGNORE for edge..." | Re-trigger Greptile}

greptile-apps · 2026-03-27T08:21:36Z

src/extractors/javascript.ts

+    let declNode = child;
    // Handle `export const …` — unwrap the export_statement to its declaration child
-    if (node.type === 'export_statement') {
-      const inner = node.childForFieldName('declaration');
-      if (!inner) continue;
-      declNode = inner;
+    if (child.type === 'export_statement') {
+      const inner = child.childForFieldName('declaration');
+      if (inner) declNode = inner;
    }

    const t = declNode.type;
-    if (t !== 'lexical_declaration' && t !== 'variable_declaration') continue;
-    if (!declNode.text.startsWith('const ')) continue;
-
-    for (let j = 0; j < declNode.childCount; j++) {
-      const declarator = declNode.child(j);
-      if (!declarator || declarator.type !== 'variable_declarator') continue;
-      const nameN = declarator.childForFieldName('name');
-      const valueN = declarator.childForFieldName('value');
-      if (!nameN || nameN.type !== 'identifier' || !valueN) continue;
-      // Skip functions — already captured by query patterns
-      const valType = valueN.type;
-      if (
-        valType === 'arrow_function' ||
-        valType === 'function_expression' ||
-        valType === 'function'
-      )
-        continue;
-      if (isConstantValue(valueN)) {
-        definitions.push({
-          name: nameN.text,
-          kind: 'constant',
-          line: declNode.startPosition.row + 1,
-          endLine: nodeEndLine(declNode),
-        });
+    if (t === 'lexical_declaration' || t === 'variable_declaration') {
+      if (declNode.text.startsWith('const ')) {
+        for (let j = 0; j < declNode.childCount; j++) {
+          const declarator = declNode.child(j);
+          if (!declarator || declarator.type !== 'variable_declarator') continue;
+          const nameN = declarator.childForFieldName('name');
+          const valueN = declarator.childForFieldName('value');
+          if (!nameN || nameN.type !== 'identifier' || !valueN) continue;
+          // Skip functions — already captured by query patterns
+          const valType = valueN.type;
+          if (
+            valType === 'arrow_function' ||
+            valType === 'function_expression' ||
+            valType === 'function'
+          )
+            continue;
+          if (isConstantValue(valueN)) {
+            definitions.push({
+              name: nameN.text,
+              kind: 'constant',
+              line: declNode.startPosition.row + 1,
+              endLine: nodeEndLine(declNode),
+            });
+          }
+        }
      }
    }
+
+    // Recurse into non-function children (blocks, if-statements, etc.)
+    extractConstantsWalk(child, definitions);


Exported constants counted twice

When child.type === 'export_statement', the code:

Unwraps inner = child.childForFieldName('declaration') (the lexical_declaration)

Processes declNode = inner → pushes the constant

Then unconditionally calls extractConstantsWalk(child, definitions) — recursing into the export_statement itself

During that recursion the lexical_declaration is found again as a direct child of export_statement (.child(i) returns all children including named-field children), declNode is set to it, and the constant is pushed a second time.

This only affects top-level export const X = value declarations, but every such constant will appear as a duplicate in the definitions array, producing ghost nodes in the graph.

The minimal fix at the bottom of the loop:

// Recurse into non-function, non-export-statement children if (child.type !== 'export_statement') { extractConstantsWalk(child, definitions); }

Fixed — added a guard to skip recursion into export_statement nodes in extractConstantsWalk. After unwrapping and processing the inner lexical_declaration, the function now skips recursing into the export_statement itself, preventing the duplicate. See commit 9215de9.

Move the entire insert-nodes stage to Rust when the native engine is available. The new `bulk_insert_nodes` napi function receives marshalled definitions, exports, children, and file-hash data from JS and writes everything to SQLite via rusqlite in a single transaction — eliminating the JS↔better-sqlite3 FFI overhead that dominated the ~206ms timing. Rust path: prepare_cached + individual executes (zero per-row FFI cost). JS fallback preserved unchanged for WASM engine or native failure. Target: insertMs < 50 ms on native full builds.

carlos-alm · 2026-03-27T09:50:17Z

@greptileai

The merge commit a8849c3 brought in `pub mod cfg_db` and `pub mod dataflow_db` declarations in lib.rs without the corresponding source files, causing Rust compilation to fail (E0583: file not found). This branch supersedes those modules with the unified insert_nodes.rs pipeline, so remove the stale mod declarations and their NativeAddon type signatures in types.ts.

carlos-alm · 2026-03-27T19:39:50Z

Addressed both P2 findings from the Greptile review:

Stale bulkInsertCfg/bulkInsertDataflow declarations in types.ts: Removed in commit 1d4f193. These type signatures referenced Rust modules (cfg_db.rs, dataflow_db.rs) that this branch deletes in favor of the unified insert_nodes.rs pipeline.
allNative variable in cfg.ts: This is NOT dead code in the committed branch. It's still used at lines 292 and 314 to gate WASM parser initialization and control per-file processing. The working tree has dirty changes from another session that obscured this.

Also fixed the root cause of all Rust CI failures: the merge commit a8849c3 added pub mod cfg_db and pub mod dataflow_db declarations to lib.rs without the corresponding source files (the files came from the other merge parent but weren't carried forward). Removed those stale module declarations.

carlos-alm · 2026-03-27T19:40:20Z

@greptileai

greptile-apps · 2026-03-27T19:51:56Z

crates/codegraph-core/src/insert_nodes.rs

+    // ── Insert all edges ─────────────────────────────────────────────
+    {
+        let mut stmt = tx.prepare_cached(
+            "INSERT INTO edges (source_id, target_id, kind, confidence, dynamic) \
+             VALUES (?1, ?2, ?3, ?4, ?5)",
+        )?;
+        for &(src, tgt) in &contains_edges {
+            stmt.execute(params![src, tgt, "contains", 1.0, 0])?;
+        }
+        for &(src, tgt) in &param_of_edges {
+            stmt.execute(params![src, tgt, "parameter_of", 1.0, 0])?;
+        }


Edge INSERT missing conflict handler — duplicates on incremental builds

The edges table has no UNIQUE constraint (confirmed in src/db/migrations.ts), so INSERT INTO edges never fails — it silently inserts duplicate edges whenever bulk_insert_nodes is called for a file whose edges were not fully cleaned up beforehand.

The incremental stage (incremental.ts:126) only deletes edges where source_id is in the changed file's nodes:

DELETE FROM edges WHERE source_id IN (SELECT id FROM nodes WHERE file = ?)

Edges where a changed-file node appears as the target are not deleted. When the native path then re-inserts fresh containment/parameter_of edges, those target-side edges remain, and the new ones are appended — producing duplicate graph edges that accumulate with each incremental rebuild.

Suggest using INSERT OR IGNORE to match the behaviour of the node inserts:

let mut stmt = tx.prepare_cached( "INSERT OR IGNORE INTO edges (source_id, target_id, kind, confidence, dynamic) \ VALUES (?1, ?2, ?3, ?4, ?5)", )?;

Fixed — changed the edge INSERT to INSERT OR IGNORE in do_insert (commit 2b6e30d). This matches the node insert behaviour and prevents duplicate edge rows from accumulating on incremental builds where target-side edges weren't pre-cleaned.

Regarding the other two findings from the summary review:

precomputed.stat.mtime without Math.floor: This is actually correct — precomputed.stat.mtime is already pre-floored at the source in detect-changes.ts:444 (mtime: Math.floor(c.stat.mtimeMs)), so applying Math.floor again would be a no-op. The rawStat.mtimeMs branches need Math.floor because they come directly from fs.statSync.

query_node_ids flat key collision: Acknowledged as a pre-existing limitation carried over from the JS fallback. Same-name parameters in different functions would need a parent-qualified key to fully disambiguate — but this is a pre-existing concern, not introduced by this PR.

…cremental builds (#654) The edges INSERT in do_insert lacked a conflict handler. Since the edges table has no UNIQUE constraint, incremental builds where target-side edges weren't pre-cleaned could silently accumulate duplicate rows. Switch to INSERT OR IGNORE to match the node insert behaviour.

carlos-alm · 2026-03-27T21:22:28Z

@greptileai

carlos-alm added 2 commits March 27, 2026 02:15

style: remove useless continue flagged by biome

e54f5a9

greptile-apps bot reviewed Mar 27, 2026

View reviewed changes

carlos-alm changed the title ~~fix(parser): close WASM-native engine parity gap~~ perf(insert-nodes): native Rust/rusqlite pipeline for node insertion Mar 27, 2026

carlos-alm added 2 commits March 27, 2026 03:37

fix: resolve merge conflicts with main

a8849c3

fix(parser): prevent double-counting export const declarations (#654)

9215de9

greptile-apps bot reviewed Mar 27, 2026

View reviewed changes

carlos-alm added 2 commits March 27, 2026 15:08

Merge branch 'main' into fix/wasm-engine-parity-649

636a70a

carlos-alm merged commit c194422 into main Mar 27, 2026
19 checks passed

carlos-alm deleted the fix/wasm-engine-parity-649 branch March 27, 2026 21:38

github-actions bot locked and limited conversation to collaborators Mar 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(insert-nodes): native Rust/rusqlite pipeline for node insertion#654

perf(insert-nodes): native Rust/rusqlite pipeline for node insertion#654
carlos-alm merged 8 commits intomainfrom
fix/wasm-engine-parity-649

carlos-alm commented Mar 27, 2026 •

edited

Loading

Uh oh!

claude bot commented Mar 27, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Mar 27, 2026 •

edited

Loading

Uh oh!

greptile-apps bot Mar 27, 2026

Uh oh!

carlos-alm Mar 27, 2026

Uh oh!

carlos-alm commented Mar 27, 2026

Uh oh!

carlos-alm commented Mar 27, 2026

Uh oh!

carlos-alm commented Mar 27, 2026

Uh oh!

greptile-apps bot Mar 27, 2026

Uh oh!

carlos-alm Mar 27, 2026

Uh oh!

carlos-alm commented Mar 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

carlos-alm commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

claude bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root Cause Analysis

Backlog Compliance

Code Quality Review

✅ Strong Points

Detailed Fix Analysis

✅ Test Strategy

Final Assessment

Uh oh!

greptile-apps bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm commented Mar 27, 2026

Uh oh!

carlos-alm commented Mar 27, 2026

Uh oh!

carlos-alm commented Mar 27, 2026

Uh oh!

greptile-apps bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm commented Mar 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

carlos-alm commented Mar 27, 2026 •

edited

Loading

claude bot commented Mar 27, 2026 •

edited

Loading

greptile-apps bot commented Mar 27, 2026 •

edited

Loading