fix: correct batch counting and squeeze dim in prott5_embedder.py by haoyu-haoyu · Pull Request #170 · agemagician/ProtTrans

haoyu-haoyu · 2026-03-26T20:28:57Z

Summary

Two bugs in Embedding/prott5_embedder.py:

Double-counting in batch residue accumulation (line 100): The current sequence is appended to batch before computing n_res_batch, so seq_len is counted once inside the sum() over the batch and again via the explicit + seq_len. This causes batches to trigger the n_res_batch >= max_residues threshold too early, resulting in smaller batches than intended and slower embedding.
Bare .squeeze() without dim (line 131): For single-residue proteins in per-residue mode, the embedding shape (1, 1024) is silently collapsed to (1024,), making it indistinguishable from a per-protein embedding. Changed to .squeeze(0) to only affect the batch dimension.

Test plan

Verified the double-counting by tracing the logic: after batch.append(...), the batch already contains the current sequence, so the separate + seq_len is redundant
.squeeze(0) is a no-op when the first dimension is > 1, preserving existing behavior for all multi-residue proteins

- Fix double-counting in batch residue accumulation: `seq_len` was counted once via the batch list (after append) and once via the explicit `+ seq_len` term, making batches smaller than intended. Removed the redundant `+ seq_len`. - Specify dim in `.squeeze(0)` instead of bare `.squeeze()`. Without a dim argument, single-residue proteins have their per-residue embedding shape (1, 1024) silently collapsed to (1024,), making them indistinguishable from per-protein embeddings.

gemini-code-assist

Code Review

This pull request updates the batch residue counting logic and refines the embedding processing in Embedding/prott5_embedder.py. The batch residue count no longer includes the current sequence length before the batch limit check, and the squeeze operation on embeddings is now restricted to the first dimension to prevent accidental reduction of other singleton dimensions. I have no feedback to provide.

gemini-code-assist Bot reviewed Mar 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: correct batch counting and squeeze dim in prott5_embedder.py#170

fix: correct batch counting and squeeze dim in prott5_embedder.py#170
haoyu-haoyu wants to merge 1 commit intoagemagician:masterfrom
haoyu-haoyu:fix/embedder-bugs

haoyu-haoyu commented Mar 26, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

haoyu-haoyu commented Mar 26, 2026

Summary

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant