refactor(embeddings): unify update_index sync/async via shared helpers#246
Open
lstein wants to merge 1 commit into
Open
refactor(embeddings): unify update_index sync/async via shared helpers#246lstein wants to merge 1 commit into
lstein wants to merge 1 commit into
Conversation
`update_index` and `update_index_async` were carrying ~80 lines of identical orchestration each: * npz load + encoder-spec mismatch check * scan for new/missing images * filter missing images * empty-input early exit * process the new batch * "no changes" noop-return that skips save + UMAP rebuild * combine + save * attach UMAP Factored out two private helpers: * `_load_existing_index_arrays()` — opens the npz, returns a typed `_ExistingIndex` NamedTuple, raises `EmbeddingCacheMismatch` when the stored encoder spec differs from the configured one * `_finalize_index_update(filtered_existing, new_result, missing_count)` — returns `(IndexResult, did_rebuild)`. When nothing changed, no save runs and the caller keeps using the existing on-disk index. An optional `on_save_start` hook fires just before the save so the async path can flip its progress tracker to "Saving updated index" only when there's actually going to be a save (the noop path would otherwise briefly show a stale message). Both top-level methods now read like thin orchestrators — the sync vs async difference is localized to `asyncio.to_thread` wrapping and `progress_tracker` calls. The shared logic that previously drifted silently between paths is now structurally guaranteed to agree. Also dropped a misleading log line: the sync method used to print the UMAP shape from `new_result.umap_embeddings` (UMAP of the *new batch only*), then attached `self.umap_embeddings` (UMAP of the *full combined set*) to the returned result. The new version logs the shape that's actually returned. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
update_indexandupdate_index_asyncwere carrying ~80 lines of identical orchestration each — npz load + encoder-spec check, new/missing scan, filter, empty-input bail, process batch, noop early-return, combine + save, UMAP attach. A subtle drift between the two paths would silently misbehave on one but not the other.Factored into two private helpers:
_load_existing_index_arrays()— opens the .npz, returns a typed_ExistingIndexNamedTuple, raisesEmbeddingCacheMismatchwhen the stored encoder spec differs from the configured one. Replaces ~20 lines of duplicated load code on each path._finalize_index_update(filtered_existing, new_result, missing_count)— returns(IndexResult, did_rebuild). Whendid_rebuild=False(no new files, no removed files) the save is skipped entirely and the caller keeps using the existing on-disk index. Replaces ~25 lines of duplicated combine/save/noop logic.Both helpers take an optional
on_save_startcallback so the async path can flip its progress tracker to "Saving updated index" only when there's actually going to be a save — without the hook, the noop path would briefly show a stale "Saving" message before the noop completion.Both top-level methods now read as thin orchestrators of the same six steps. The sync↔async difference is localized to:
asyncio.to_threadaround the blocking calls (scan, finalize, UMAP property access)progress_trackercalls keyed onalbum_keyThe shared logic that previously drifted silently is now structurally guaranteed to agree.
Behavioral note
Dropped a misleading log line in the sync path. The old code did:
new_result.umap_embeddingswas the UMAP of the new batch only (computed inside_process_images_batch), but the result returned to the caller had the full-combined-set UMAP attached. The log line was reporting the wrong shape. The refactor logs the shape that's actually returned.Test plan
ruff check photomap tests— cleanpytest tests/backend— 257 passednpm test— 288 passed (not touched but ran for safety)update_indexon the same album with no file changes and confirm the noop path completes without "Saving updated index" flickerNet +164 / −133 (one file). The line count is up slightly because the new helpers carry docstrings explaining the contracts, but ~80 lines of behaviorally-duplicated code is now gone.
🤖 Generated with Claude Code