fix: correct batch counting and squeeze dim in prott5_embedder.py#170
Open
haoyu-haoyu wants to merge 1 commit intoagemagician:masterfrom
Open
fix: correct batch counting and squeeze dim in prott5_embedder.py#170haoyu-haoyu wants to merge 1 commit intoagemagician:masterfrom
haoyu-haoyu wants to merge 1 commit intoagemagician:masterfrom
Conversation
- Fix double-counting in batch residue accumulation: `seq_len` was counted once via the batch list (after append) and once via the explicit `+ seq_len` term, making batches smaller than intended. Removed the redundant `+ seq_len`. - Specify dim in `.squeeze(0)` instead of bare `.squeeze()`. Without a dim argument, single-residue proteins have their per-residue embedding shape (1, 1024) silently collapsed to (1024,), making them indistinguishable from per-protein embeddings.
There was a problem hiding this comment.
Code Review
This pull request updates the batch residue counting logic and refines the embedding processing in Embedding/prott5_embedder.py. The batch residue count no longer includes the current sequence length before the batch limit check, and the squeeze operation on embeddings is now restricted to the first dimension to prevent accidental reduction of other singleton dimensions. I have no feedback to provide.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two bugs in
Embedding/prott5_embedder.py:Double-counting in batch residue accumulation (line 100): The current sequence is appended to
batchbefore computingn_res_batch, soseq_lenis counted once inside thesum()over the batch and again via the explicit+ seq_len. This causes batches to trigger then_res_batch >= max_residuesthreshold too early, resulting in smaller batches than intended and slower embedding.Bare
.squeeze()without dim (line 131): For single-residue proteins in per-residue mode, the embedding shape(1, 1024)is silently collapsed to(1024,), making it indistinguishable from a per-protein embedding. Changed to.squeeze(0)to only affect the batch dimension.Test plan
batch.append(...), the batch already contains the current sequence, so the separate+ seq_lenis redundant.squeeze(0)is a no-op when the first dimension is > 1, preserving existing behavior for all multi-residue proteins