Skip to content

fix(sqlite-provider): use caller-provided key column name#9

Merged
anoop-narang merged 2 commits intomainfrom
fix/sqlite-provider-dynamic-key-col
Mar 18, 2026
Merged

fix(sqlite-provider): use caller-provided key column name#9
anoop-narang merged 2 commits intomainfrom
fix/sqlite-provider-dynamic-key-col

Conversation

@anoop-narang
Copy link
Collaborator

Summary

  • SqliteLookupProvider previously hardcoded row_idx as the key column name in CREATE TABLE and WHERE clauses
  • This caused no such column: row_idx errors when the caller's schema used a different key column name (e.g. _key)
  • Now derives the key column name from the first field in the provided schema, matching the existing convention that the first schema field is always the primary key

Test plan

  • All existing tests pass (they use row_idx as key col — still works since name comes from schema)
  • cargo fmt --check clean
  • cargo clippy --all-features -- -D warnings clean

a.iter()
.zip(b.iter())
.map(|(x, y)| (x - y) * (x - y))
.sum::<f32>()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 — breaking API change bundled into a bug-fix PR.

Removing .sqrt() changes l2_distance from returning actual Euclidean distance to returning squared L2. While sort order is preserved (ranking is still correct), any user with a threshold-based filter silently gets wrong results:

-- before: this excluded rows farther than 1.5 actual units apart
-- after:  this excludes rows farther than sqrt(1.5) ≈ 1.22 actual units apart
WHERE l2_distance(vector, query) < 1.5

The consistency argument (match USearch's internal L2sq) is valid, but the right fix is to normalize the index path (apply sqrt to USearch's returned distances) — not to change the UDF's contract. The function name l2_distance strongly implies actual Euclidean distance.

Options:

  1. Revert this change and apply sqrt to USearch distances on the index path instead.
  2. If L2sq semantics are intentional, rename to l2sq_distance (or l2_distance_squared) so callers aren't silently broken.

This change should also be a separate PR from the SQLite key-column fix.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR only contains the sqlite-provider key column fix. The l2_distance sqrt removal was in PR #8 (already merged and squash-merged to main). After rebasing, the diff for this PR contains only the sqlite_provider.rs changes. The bot appears to have reviewed a stale diff before the rebase.

let sql = format!(
"SELECT id FROM items WHERE label = 'nonexistent' ORDER BY l2_distance(vector, {Q}) ASC LIMIT 2"
);
let ids = collect_ids(&ctx, &sql).await;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 — the actual bug (non-row_idx key column in SqliteLookupProvider) has no regression test.

The PR test plan notes that existing tests still use row_idx, which is coincidentally the default. A test that builds a SqliteLookupProvider with a schema whose first field is named something other than row_idx (e.g. _key) and then calls fetch_by_keys would directly exercise the fixed code path and guard against future regressions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 6934d3atest_custom_key_column_name exercises the provider with _key as the key column, testing both fetch_by_keys and projection.

Copy link

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Issues

P1 — src/udf.rs line 31: breaking semantic change to l2_distance

Removing .sqrt() changes the UDF's return value from actual Euclidean distance to squared L2. Sort order is preserved so ANN ranking is unaffected, but threshold-based filters silently produce wrong results for any threshold ≠ 1.0. This is a public API contract violation bundled into a SQLite bug-fix PR.

P2 — tests/execution.rs: no regression test for the stated bug fix

All SQLite-provider tests use row_idx as the key column, which happened to work before the fix. There is no test that exercises SqliteLookupProvider with a non-row_idx key column (e.g. _key), which is the exact scenario the PR claims to fix.

Action Required

  1. Either revert the l2_distance sqrt removal and normalize the index path instead, or rename the UDF to l2sq_distance to avoid silent breakage for callers with distance thresholds. Either way, split it into a separate PR.
  2. Add a SqliteLookupProvider test that uses a key column name other than row_idx.

The SqliteLookupProvider previously hardcoded "row_idx" as the key
column name in CREATE TABLE and WHERE clauses. This caused errors
when callers used a different key column name (e.g. "_key").

Now derives the key column name from the first field in the provided
schema, making the provider work with any key column name.
@anoop-narang anoop-narang force-pushed the fix/sqlite-provider-dynamic-key-col branch from f70c668 to e49e0d1 Compare March 18, 2026 10:53
parquet_col_indices: &[usize],
) -> DFResult<Self> {
// The first field in the schema is the key column (INTEGER PRIMARY KEY).
let key_col = schema.field(0).name().clone();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: all existing tests happen to use row_idx as the key column name, so the only coverage here is "dynamic name produces the same SQL as the old hardcoded string." Consider adding a test in sqlite_provider_test.rs that builds a provider with a schema whose first field is named something other than row_idx (e.g. _key) and confirms that fetch_by_keys returns the correct rows. That would give you a direct regression guard for the actual bug scenario.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — added test_custom_key_column_name that uses _key as the key column name.

claude[bot]
claude bot previously approved these changes Mar 18, 2026
Exercises SqliteLookupProvider with "_key" as the key column (the
scenario used by runtimedb), verifying both fetch_by_keys and
projection work with non-default key column names.
let keys_vec = keys.to_vec();
let pool = self.pool.clone();
let table_name = self.table_name.clone();
let key_col = self.key_col.clone();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking suggestion: fetch_by_keys takes a _key_col: &str parameter (from the PointLookupProvider trait) that is silently ignored here in favour of self.key_col. This was already the pre-existing behaviour in HashKeyProvider, so this PR is consistent, but it means a caller who passes the wrong column name gets no feedback.\n\nConsider a debug_assert_eq!(_key_col, self.key_col) guard, or removing the parameter from the trait entirely in a follow-up.

@anoop-narang anoop-narang merged commit 42abb40 into main Mar 18, 2026
5 checks passed
@anoop-narang anoop-narang deleted the fix/sqlite-provider-dynamic-key-col branch March 18, 2026 11:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant