Skip to content

refactor(embeddings): unify update_index sync/async via shared helpers#246

Open
lstein wants to merge 1 commit into
masterfrom
lstein/refactor/update-index-unify
Open

refactor(embeddings): unify update_index sync/async via shared helpers#246
lstein wants to merge 1 commit into
masterfrom
lstein/refactor/update-index-unify

Conversation

@lstein
Copy link
Copy Markdown
Owner

@lstein lstein commented May 20, 2026

Summary

update_index and update_index_async were carrying ~80 lines of identical orchestration each — npz load + encoder-spec check, new/missing scan, filter, empty-input bail, process batch, noop early-return, combine + save, UMAP attach. A subtle drift between the two paths would silently misbehave on one but not the other.

Factored into two private helpers:

  • _load_existing_index_arrays() — opens the .npz, returns a typed _ExistingIndex NamedTuple, raises EmbeddingCacheMismatch when the stored encoder spec differs from the configured one. Replaces ~20 lines of duplicated load code on each path.

  • _finalize_index_update(filtered_existing, new_result, missing_count) — returns (IndexResult, did_rebuild). When did_rebuild=False (no new files, no removed files) the save is skipped entirely and the caller keeps using the existing on-disk index. Replaces ~25 lines of duplicated combine/save/noop logic.

Both helpers take an optional on_save_start callback so the async path can flip its progress tracker to "Saving updated index" only when there's actually going to be a save — without the hook, the noop path would briefly show a stale "Saving" message before the noop completion.

Both top-level methods now read as thin orchestrators of the same six steps. The sync↔async difference is localized to:

  • asyncio.to_thread around the blocking calls (scan, finalize, UMAP property access)
  • progress_tracker calls keyed on album_key

The shared logic that previously drifted silently is now structurally guaranteed to agree.

Behavioral note

Dropped a misleading log line in the sync path. The old code did:

combined_result.umap_embeddings = self.umap_embeddings   # full-set UMAP
assert new_result.umap_embeddings is not None
logger.info(f"UMAP index created with shape: {new_result.umap_embeddings.shape}")  # ← new-batch only!

new_result.umap_embeddings was the UMAP of the new batch only (computed inside _process_images_batch), but the result returned to the caller had the full-combined-set UMAP attached. The log line was reporting the wrong shape. The refactor logs the shape that's actually returned.

Test plan

  • ruff check photomap tests — clean
  • pytest tests/backend — 257 passed
  • npm test — 288 passed (not touched but ran for safety)
  • Manual: index a fresh album end-to-end and confirm the progress UI walks scanning → indexing → mapping correctly
  • Manual: re-trigger update_index on the same album with no file changes and confirm the noop path completes without "Saving updated index" flicker
  • Manual: add one image to an indexed album, rerun update — confirm the new image is searchable

Net +164 / −133 (one file). The line count is up slightly because the new helpers carry docstrings explaining the contracts, but ~80 lines of behaviorally-duplicated code is now gone.

🤖 Generated with Claude Code

`update_index` and `update_index_async` were carrying ~80 lines of
identical orchestration each:

* npz load + encoder-spec mismatch check
* scan for new/missing images
* filter missing images
* empty-input early exit
* process the new batch
* "no changes" noop-return that skips save + UMAP rebuild
* combine + save
* attach UMAP

Factored out two private helpers:

* `_load_existing_index_arrays()` — opens the npz, returns a typed
  `_ExistingIndex` NamedTuple, raises `EmbeddingCacheMismatch` when the
  stored encoder spec differs from the configured one
* `_finalize_index_update(filtered_existing, new_result, missing_count)`
  — returns `(IndexResult, did_rebuild)`. When nothing changed, no save
  runs and the caller keeps using the existing on-disk index. An
  optional `on_save_start` hook fires just before the save so the async
  path can flip its progress tracker to "Saving updated index" only
  when there's actually going to be a save (the noop path would
  otherwise briefly show a stale message).

Both top-level methods now read like thin orchestrators — the sync vs
async difference is localized to `asyncio.to_thread` wrapping and
`progress_tracker` calls. The shared logic that previously drifted
silently between paths is now structurally guaranteed to agree.

Also dropped a misleading log line: the sync method used to print the
UMAP shape from `new_result.umap_embeddings` (UMAP of the *new batch
only*), then attached `self.umap_embeddings` (UMAP of the *full
combined set*) to the returned result. The new version logs the shape
that's actually returned.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant