Skip to content

perf(db): bulk CFG and dataflow DB writes via rusqlite#653

Merged
carlos-alm merged 11 commits intomainfrom
perf/ast-node-bulk-insert
Mar 27, 2026
Merged

perf(db): bulk CFG and dataflow DB writes via rusqlite#653
carlos-alm merged 11 commits intomainfrom
perf/ast-node-bulk-insert

Conversation

@carlos-alm
Copy link
Copy Markdown
Contributor

Summary

  • Add bulk_insert_cfg() and bulk_insert_dataflow() Rust napi functions that perform CFG block/edge and dataflow edge inserts via rusqlite in single transactions, bypassing JS iteration loops
  • Add native fast paths in buildCFGData() and buildDataflowEdges() that collect batches from pre-computed native extraction results and delegate to Rust bulk inserters
  • Both paths fall back to existing JS code when the native addon is unavailable (WASM engine)

Details

Same optimization pattern as bulk_insert_ast_nodes (6.9). The DB write phases for CFG (161ms) and dataflow (125ms) were identical JS code on both engines — this moves them to Rust when native engine is active.

Rust side:

  • cfg_db.rs — resolves function node IDs, deletes stale data, inserts blocks+edges with block index→row ID mapping
  • dataflow_db.rs — pre-builds node resolution cache (local-first, global fallback), inserts flows_to/returns/mutates edges

Target: cfgMs + dataflowMs < 50ms combined on native full builds (current ~286ms)

Test plan

  • CFG integration tests pass (11/11)
  • Dataflow integration tests pass (23/23)
  • CFG all-langs parser tests pass (40/40)
  • Dataflow JS parser tests pass (13/13)
  • Dataflow engine parity tests pass (13/13)
  • Biome lint clean on all changed files
  • Verify Rust compilation in CI (local MSVC linker env issue prevented native build)
  • Benchmark cfgMs + dataflowMs on native full builds after CI produces the addon

Move AST node SQLite inserts from per-row JS iteration to a single
native Rust transaction via napi-rs + rusqlite. The new
bulkInsertAstNodes function opens the DB directly from Rust,
pre-fetches parent node definitions, and inserts all rows in one
transaction — eliminating the JS-native FFI overhead per row.

The JS-side buildAstNodes tries the native fast path first (when
all files have native astNodes arrays), falling back to the existing
JS loop for WASM or mixed-engine scenarios.

Target: astMs < 50ms on native full builds (was ~393ms).
The Rust connection omitted busy_timeout = 5000 which the JS-side
connection.ts sets. Without it, SQLITE_BUSY is returned immediately
on WAL contention instead of retrying for 5 seconds.
)

bulkInsertAstNodes returns 0 for both "nothing to insert" and hard
errors (DB open failure, SQLITE_BUSY, etc). Compare expected vs actual
count and fall through to the JS path on mismatch so errors don't
silently drop all AST nodes.
Explain why bundled is intentional: Windows CI lacks system SQLite,
and dual-instance WAL coordination is OS-safe.
The Rust find_parent_id skipped definitions with end_line = NULL,
but the JS findParentDef treats them as always-enclosing with a
negative span (preferred over wider defs). This caused parent_node_id
mismatches between native and JS paths.
Return 0 immediately on any insert_stmt.execute() failure so the
transaction drops and rolls back, ensuring all-or-nothing semantics.
Previously, .is_ok() silently swallowed row-level errors which could
commit partial data and misfire the JS fallback causing duplicate rows.
Move CFG block/edge and dataflow edge inserts from JS iteration to Rust
bulk operations, following the same pattern as bulk_insert_ast_nodes (6.9).

Rust side:
- cfg_db.rs: bulk_insert_cfg() resolves function node IDs, deletes stale
  data, inserts blocks+edges in a single rusqlite transaction
- dataflow_db.rs: bulk_insert_dataflow() pre-builds node resolution cache
  (local-first, global fallback), inserts edges in a single transaction

JS side:
- cfg.ts: native fast path collects CfgFunctionBatch[] and delegates to
  Rust when all CFG is pre-computed by the native engine
- dataflow.ts: native fast path converts DataflowResult (argFlows,
  assignments, mutations) into FileDataflowBatch[] for Rust insertion
- Both fall back to existing JS paths when native addon is unavailable

Target: cfgMs + dataflowMs < 50ms combined (from ~286ms with JS iteration)
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 27, 2026

Greptile Summary

This PR extends the bulk-insert Rust optimization pattern (established in 6.9 for AST nodes) to the CFG and dataflow DB write phases, targeting a combined cfgMs + dataflowMs under 50 ms by bypassing JS iteration loops. The implementation follows the same shape as ast_db.rs: two new rusqlite-backed NAPI functions (bulk_insert_cfg, bulk_insert_dataflow), native fast-paths in buildCFGData / buildDataflowEdges, and JS fallbacks for WASM environments.\n\nKey findings:\n\n- CFG path is correct and safe. cfg_db.rs always deletes stale rows before inserting (per function_node_id), so the JS fallback path — which also deletes before inserting — is idempotent even after a partial Rust commit. The fallback condition processed === batches.length || expectedFunctions === 0 from the previous review cycle is correctly implemented.\n\n- Dataflow path has a data-integrity issue (P1). dataflow_db.rs does not delete existing rows before inserting. When any edge references a function name that can't be resolved (e.g. external lib calls), the Rust transaction commits a partial result and inserted < expectedEdges. The JS fallback then re-inserts all resolvable edges via a plain INSERT — the dataflow table has no UNIQUE constraint, so duplicate rows accumulate silently and appear in all dataflowData() query results. The fix mirrors the build-stmts.ts per-file DELETE FROM dataflow WHERE source_id IN (SELECT id FROM nodes WHERE file = ?) pattern and should be applied inside bulk_insert_dataflow before the insert transaction.\n\n- The bulkInsertDataflow fallback signal semantics are otherwise correctly addressed vs. the prior review thread: inserted === expectedEdges || expectedEdges === 0 is the right success gate and the code falls through on mismatch.

Confidence Score: 3/5

Not safe to merge as-is: the dataflow bulk-insert path can create duplicate DB rows in normal usage, corrupting query results.

Prior P1 feedback on fallback logic is well-addressed for both cfg.ts and dataflow.ts. The CFG implementation is correct and idempotent. However, a new concrete P1 remains: dataflow_db.rs commits partial results without a preceding DELETE, and the JS fallback re-inserts the same resolved edges into an unconstrained table. This is not a hypothetical — any project with calls to external/unresolvable functions will trigger it. A focused one-function fix (add DELETE-before-insert in bulk_insert_dataflow matching the build-stmts.ts pattern) would bring this to a 4 or 5.

crates/codegraph-core/src/dataflow_db.rs and src/features/dataflow.ts — the duplicate-on-fallback path.

Important Files Changed

Filename Overview
crates/codegraph-core/src/cfg_db.rs New bulk CFG inserter; correctly deletes stale rows before inserting per function_node_id, making JS fallback idempotent. Returns total of resolved-node functions. Minor: 0 is overloaded (empty-batch AND error), which is tolerable given caller guards.
crates/codegraph-core/src/dataflow_db.rs New bulk dataflow inserter; local-first + global-fallback node resolution matches JS behavior. Missing: no per-file DELETE before inserting, so a partial-success commit followed by JS fallback produces duplicate rows in the dataflow table.
src/features/cfg.ts Native fast path guarded by allNative; fallback check (processed === batches.length
src/features/dataflow.ts Native fast path fallback logic addresses prior feedback, but falls through to a pure-INSERT JS path after a committed partial Rust result — no delete guard means duplicates in the dataflow table when unresolvable nodes cause inserted < expectedEdges.
crates/codegraph-core/src/lib.rs Adds cfg_db and dataflow_db module declarations; straightforward change.
src/types.ts Adds NativeAddon interface signatures for bulkInsertCfg and bulkInsertDataflow matching Rust struct definitions; no issues.

Sequence Diagram

sequenceDiagram
    participant JS as buildDataflowEdges (TS)
    participant Rust as bulk_insert_dataflow (Rust)
    participant SQLite as SQLite DB

    JS->>JS: Collect file batches from symbols.dataflow
    JS->>Rust: bulkInsertDataflow(db.name, batches)
    Rust->>SQLite: Open connection (READ_WRITE)
    Rust->>SQLite: Phase 1 — resolve node IDs into cache
    Rust->>SQLite: BEGIN TRANSACTION
    loop For each edge
        alt Node resolved
            Rust->>SQLite: INSERT INTO dataflow (...)
        else Node unresolvable
            Rust->>Rust: continue (skip, no delete)
        end
    end
    Rust->>SQLite: COMMIT
    Rust-->>JS: inserted count

    alt inserted === expectedEdges OR expectedEdges === 0
        JS->>JS: return (success)
    else inserted < expectedEdges (unresolvable nodes)
        note over SQLite: Partial edges already committed
        JS->>SQLite: JS fallback — INSERT same resolvable edges again
        note over SQLite: Duplicates created, no UNIQUE constraint
    end
Loading

Reviews (2): Last reviewed commit: "fix(db): add JS fallback on bulk-insert ..." | Re-trigger Greptile

Comment on lines +357 to +363
if (!needsJsFallback) {
const processed = native.bulkInsertCfg(db.name, batches);
const withBlocks = batches.filter((b) => b.blocks.length > 0).length;
if (processed > 0) {
info(`CFG: ${withBlocks} functions analyzed (native bulk)`);
}
return;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Silent data loss on bulk-insert failure — no JS fallback

bulk_insert_cfg returning 0 indicates a Rust-side error (DB connection failure, missing table, failed transaction, etc.). The current code unconditionally returns after calling it, meaning any such failure silently drops all CFG writes without attempting the JS fallback path. This diverges from the established pattern in ast.ts, which explicitly checks the count and falls through on failure.

Compare ast.ts:

if (inserted === expectedNodes) {
  debug(`AST extraction (native bulk): ${inserted} nodes stored`);
  return;
}
debug(`AST extraction (native bulk): expected ${expectedNodes}, got ${inserted} — falling back to JS`);
// fall through to JS path

The CFG path should follow the same pattern — check the return value and only return on success:

if (!needsJsFallback) {
  const processed = native.bulkInsertCfg(db.name, batches);
  const withBlocks = batches.filter((b) => b.blocks.length > 0).length;
  if (processed > 0 || withBlocks === 0) {
    if (processed > 0) info(`CFG: ${withBlocks} functions analyzed (native bulk)`);
    return;
  }
  debug(`CFG: bulk insert returned 0 — falling back to JS`);
  // fall through
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — cfg.ts now checks processed === batches.length and falls through to the JS path on mismatch, matching the ast.ts reference pattern. Also handles the empty-blocks edge case by checking expectedFunctions === 0.

Comment on lines +333 to +337
if (!needsJsFallback) {
const inserted = native.bulkInsertDataflow(db.name, batches);
info(`Dataflow: ${inserted} edges inserted (native bulk)`);
return;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Silent data loss on bulk-insert failure — no JS fallback

Same issue as in cfg.ts: bulk_insert_dataflow returning 0 silently drops all dataflow writes without attempting the JS fallback. If all batches have edges but Rust returns 0, no edges are written and no fallback is attempted while the caller observes a clean return.

The ast.ts pattern protects against this by checking inserted === expectedNodes. The dataflow path should do the same:

if (!needsJsFallback) {
  const inserted = native.bulkInsertDataflow(db.name, batches);
  const expectedEdges = batches.reduce((s, b) => s + b.edges.length, 0);
  if (inserted === expectedEdges || expectedEdges === 0) {
    info(`Dataflow: ${inserted} edges inserted (native bulk)`);
    return;
  }
  debug(
    `Dataflow: bulk insert expected ${expectedEdges} edges, got ${inserted} — falling back to JS`,
  );
  // fall through to JS path
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — dataflow.ts now checks inserted === expectedEdges and falls through to the JS path on mismatch, matching the ast.ts reference pattern.

Match the ast.ts reference pattern: check the Rust return value
against the expected count and fall through to the JS path when
they disagree, preventing silent data loss on Rust-side errors.
@carlos-alm
Copy link
Copy Markdown
Contributor Author

Addressed all Greptile feedback in 1a20f46:

  • P1 (cfg.ts): Added count-mismatch fallback — processed === batches.length check before returning, with debug() log and fall-through to JS on mismatch.
  • P1 (dataflow.ts): Same pattern — inserted === expectedEdges check before returning.
  • P2 (cfg_db.rs empty-blocks count): The JS-side check now compares against batches.length (total functions including empty-block ones), so the Rust return value aligns correctly. The empty-blocks edge case is handled by the expectedFunctions === 0 guard.

All 11 CFG and 23 dataflow integration tests pass.

@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

@carlos-alm carlos-alm merged commit b302648 into main Mar 27, 2026
27 checks passed
@carlos-alm carlos-alm deleted the perf/ast-node-bulk-insert branch March 27, 2026 07:51
@github-actions github-actions bot locked and limited conversation to collaborators Mar 27, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant