Skip to content

feat(tracing): granular spans + auto-index migration for SQLite#10

Merged
anoop-narang merged 2 commits intomainfrom
feat/granular-tracing-spans
Mar 18, 2026
Merged

feat(tracing): granular spans + auto-index migration for SQLite#10
anoop-narang merged 2 commits intomainfrom
feat/granular-tracing-spans

Conversation

@anoop-narang
Copy link
Collaborator

Summary

  • Add granular tracing spans inside usearch_execute to break down the opaque single span into:
    • usearch_hnsw_search: pure HNSW index lookup (records k and dims)
    • usearch_sqlite_fetch: SQLite point lookup by keys (records fetch count)
    • usearch_attach_distances: distance column assembly
  • Auto-create index on key column for legacy SQLite tables that were built before the INTEGER PRIMARY KEY fix (PR fix(sqlite-provider): use caller-provided key column name #9). On startup, ensure_key_index() checks for an existing PK or secondary index and creates one if missing.

This fixed a 40x performance regression — legacy tables without a key index were doing full table scans (216ms for 5 rows from 1.2M), now 0.1ms.

Test plan

  • cargo fmt --check — clean
  • cargo clippy -- -D warnings — clean
  • cargo test --features sqlite-provider — all 7 tests pass
  • Verified auto-migration creates index on legacy table (no PK)
  • Verified spans appear in Jaeger via OTEL
  • Benchmarked: 250ms → 6ms end-to-end on 1.2M row dataset

…ch_distances

Break down the opaque usearch_execute span into sub-spans:
- usearch_hnsw_search: pure HNSW index lookup (with k and dims)
- usearch_sqlite_fetch: point lookup by keys (with fetch count)
- usearch_attach_distances: distance column assembly

Applied to both unfiltered and filtered (high-selectivity) paths.
@anoop-narang anoop-narang force-pushed the feat/granular-tracing-spans branch 2 times, most recently from bd02ce8 to 4cfb4cc Compare March 18, 2026 14:04
let has_index: bool = conn
.query_row(
"SELECT COUNT(*) FROM sqlite_master WHERE type='index' AND tbl_name=?1 AND sql LIKE ?2",
rusqlite::params![table_name, format!("%{}%", quote_ident(key_col))],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 — LIKE wildcards in column name cause false positives

quote_ident(key_col) wraps the name in double-quotes but doesn't escape SQLite LIKE metacharacters (% and _). With the commonly-used key column _key, the pattern becomes %"_key"%, where _ is a LIKE wildcard matching any single character. This matches index SQL containing "akey", "bkey", etc. — any column whose name ends in key. If such an index exists on the table, has_index returns true and the migration is silently skipped.

Suggested fix — escape % and _ in the pattern and add an ESCAPE clause:

Suggested change
rusqlite::params![table_name, format!("%{}%", quote_ident(key_col))],
rusqlite::params![table_name, format!("%{}%", quote_ident(key_col).replace('%', "\\%").replace('_', "\\_")),

And change the query to:

... AND sql LIKE ?2 ESCAPE '\'

Alternatively, avoid SQL text matching entirely and use pragma_index_list / pragma_index_info to inspect indexes programmatically — that approach handles all edge cases (unquoted identifiers, multi-column indexes, auto indexes).

// Check if any index already covers the key column.
let has_index: bool = conn
.query_row(
"SELECT COUNT(*) FROM sqlite_master WHERE type='index' AND tbl_name=?1 AND sql LIKE ?2",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 — SQL text matching is brittle for manually created indexes

Matching against sqlite_master.sql with a LIKE pattern on the quoted identifier means indexes created without quoting (e.g. CREATE INDEX idx ON t(key_col)) won't be detected. The result is a harmless duplicate index (different name, SQLite allows it), but it's silent noise on every startup for any table where the index was created outside this code path.

More robust: use pragma_index_list(?1) to enumerate indexes on the table, then pragma_index_info(idx_name) to check column names — no SQL text matching needed.

claude[bot]
claude bot previously approved these changes Mar 18, 2026
col = quote_ident(key_col),
),
[],
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 — Race condition: missing IF NOT EXISTS

If two processes start simultaneously against the same database (a realistic deployment pattern), both can pass the has_index check above and then both attempt CREATE INDEX "idx_table_key". The second attempt hits SQLITE_ERROR: index already exists and propagates as a DataFusionError, aborting provider initialization for that process.

The same failure occurs if the has_index LIKE check produces a false negative (see below) and an index with the same auto-generated name already exists in the database.

Suggested change
)
conn.execute(
&format!(
"CREATE INDEX IF NOT EXISTS {idx} ON {tn}({col})",
idx = quote_ident(&format!("idx_{table_name}_{key_col}")),
tn = quote_ident(table_name),
col = quote_ident(key_col),
),
[],
)
.map_err(|e| DataFusionError::Execution(format!("failed to create key index: {e}")))?;

Comment on lines +132 to +139
let has_index: bool = conn
.query_row(
"SELECT COUNT(*) FROM sqlite_master WHERE type='index' AND tbl_name=?1 AND sql LIKE ?2",
rusqlite::params![table_name, format!("%{}%", quote_ident(key_col))],
|row| row.get::<_, i64>(0),
)
.map(|n| n > 0)
.unwrap_or(false);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 (suggestion) — has_index check against sqlite_master.sql text is fragile

Two problems with the LIKE-on-SQL approach:

  1. False negatives: If an index was created with unquoted identifiers (CREATE INDEX idx ON t(col) rather than CREATE INDEX "idx" ON "t"("col")), the pattern %"col"% won't match even though the index exists. Combined with the missing IF NOT EXISTS (see other comment), this can cause a startup failure if the auto-generated index name collides with the existing one.

  2. LIKE wildcards in column names: quote_ident(key_col) doesn't escape % or _, so a column named e.g. key_100% would produce a LIKE pattern with unescaped metacharacters.

More robust approach — use pragma_index_list + pragma_index_info to check coverage structurally rather than parsing SQL text:

let has_index: bool = {
    let mut stmt = conn
        .prepare(&format!(
            "SELECT ii.name FROM pragma_index_list({tn}) AS il
             JOIN pragma_index_info(il.name) AS ii ON ii.name = ?1",
            tn = quote_ident(table_name),
        ))
        .unwrap_or_else(|_| return);
    stmt.exists(rusqlite::params![key_col]).unwrap_or(false)
};

Non-blocking suggestion — adding IF NOT EXISTS (other comment) makes the current check safe enough for this use case.

@anoop-narang anoop-narang force-pushed the feat/granular-tracing-spans branch from 37c3d92 to 6862b99 Compare March 18, 2026 14:12
On startup, ensure_key_index() checks whether the key column has a
PRIMARY KEY or secondary index. If neither exists, it creates one.
This prevents accidental full table scans if a table is ever built
or altered without a proper key index.
@anoop-narang anoop-narang force-pushed the feat/granular-tracing-spans branch from 6862b99 to 09f63d8 Compare March 18, 2026 14:12
.await
}
.instrument(tracing::info_span!(
"usearch_sqlite_fetch",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion (non-blocking): The span name "usearch_sqlite_fetch" is baked into lookup_provider.fetch_by_keys(...), but lookup_provider is provider-agnostic — it could be a ParquetLookupProvider. This will appear as usearch_sqlite_fetch in traces even when no SQLite is involved, which may confuse consumers of the telemetry.

Consider "usearch_lookup_fetch" (or similar) here and at the corresponding site in adaptive_filtered_execute (~line 517).

@anoop-narang
Copy link
Collaborator Author

Addressed all bot review comments:

  • P1 (race condition): Added IF NOT EXISTS to CREATE INDEX statement.
  • P2 (LIKE wildcards): Replaced sqlite_master LIKE matching with pragma_index_list + pragma_index_info — inspects indexes programmatically, no SQL text matching.
  • P3 (brittle SQL matching): Same fix as P2.

@anoop-narang
Copy link
Collaborator Author

Re: span name usearch_sqlite_fetch — acknowledged, but keeping as-is for now. SQLite is the only lookup provider implementation and the specific name helps in trace analysis. If we add other backends we can rename then.

@anoop-narang anoop-narang merged commit a016358 into main Mar 18, 2026
5 checks passed
@anoop-narang anoop-narang deleted the feat/granular-tracing-spans branch March 18, 2026 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant