Skip to content

Add tcrblosum support to TCRdist#685

Open
felixpetschko wants to merge 21 commits into
scverse:mainfrom
felixpetschko:feature/tcrblosum
Open

Add tcrblosum support to TCRdist#685
felixpetschko wants to merge 21 commits into
scverse:mainfrom
felixpetschko:feature/tcrblosum

Conversation

@felixpetschko
Copy link
Copy Markdown
Collaborator

So far, the TCRdist metric used a distance matrix derived from the blosum62 substitution matrix. This PR extends TCRdistDistanceCalculator with a new base_matrix="tcrblosum" option alongside the existing default blosum62 behavior. This way, distance matrices based on the tcrblosum substitution matrices (different matrices for alpha and beta chain) are used for the TCRdist metric calculation.

I try to illustrate how I derived the tcrblosum based distance matrices in this google colab notebook.

The usage of the tcrblosum matrices was already discussed in #591.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 6, 2026

Codecov Report

❌ Patch coverage is 20.00000% with 28 lines in your changes missing coverage. Please review.
✅ Project coverage is 19.60%. Comparing base (38f05cd) to head (86f4514).

Files with missing lines Patch % Lines
src/scirpy/ir_dist/metrics.py 18.75% 26 Missing ⚠️
src/scirpy/ir_dist/__init__.py 33.33% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #685      +/-   ##
==========================================
- Coverage   19.75%   19.60%   -0.16%     
==========================================
  Files          51       51              
  Lines        4581     4607      +26     
==========================================
- Hits          905      903       -2     
- Misses       3676     3704      +28     
Files with missing lines Coverage Δ
src/scirpy/ir_dist/__init__.py 21.21% <33.33%> (ø)
src/scirpy/ir_dist/metrics.py 13.58% <18.75%> (-0.99%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@felixpetschko felixpetschko requested a review from grst April 6, 2026 10:01
@grst grst added the run-gpu-ci runs GPU CI label Apr 7, 2026
Copy link
Copy Markdown
Collaborator

@grst grst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation-wise this looks great!

What's still missing is

  • changelog update
  • Documentation-update of the user-facing (pp.ir_dist) method. Probably best to add a new metric tcrblosum or tcrdist_tcrblosum.
  • Reference to the TCRblosum paper in the documentation
  • Maybe tutorial update?

Comment thread src/scirpy/ir_dist/metrics.py Outdated
# fmt: off
tcr_dict_distance_matrix = {('A', 'A'): 0, ('A', 'C'): 4, ('A', 'D'): 4, ('A', 'E'): 4, ('A', 'F'): 4, ('A', 'G'): 4, ('A', 'H'): 4, ('A', 'I'): 4, ('A', 'K'): 4, ('A', 'L'): 4, ('A', 'M'): 4, ('A', 'N'): 4, ('A', 'P'): 4, ('A', 'Q'): 4, ('A', 'R'): 4, ('A', 'S'): 3, ('A', 'T'): 4, ('A', 'V'): 4, ('A', 'W'): 4, ('A', 'Y'): 4, ('C', 'A'): 4, ('C', 'C'): 0, ('C', 'D'): 4, ('C', 'E'): 4, ('C', 'F'): 4, ('C', 'G'): 4, ('C', 'H'): 4, ('C', 'I'): 4, ('C', 'K'): 4, ('C', 'L'): 4, ('C', 'M'): 4, ('C', 'N'): 4, ('C', 'P'): 4, ('C', 'Q'): 4, ('C', 'R'): 4, ('C', 'S'): 4, ('C', 'T'): 4, ('C', 'V'): 4, ('C', 'W'): 4, ('C', 'Y'): 4, ('D', 'A'): 4, ('D', 'C'): 4, ('D', 'D'): 0, ('D', 'E'): 2, ('D', 'F'): 4, ('D', 'G'): 4, ('D', 'H'): 4, ('D', 'I'): 4, ('D', 'K'): 4, ('D', 'L'): 4, ('D', 'M'): 4, ('D', 'N'): 3, ('D', 'P'): 4, ('D', 'Q'): 4, ('D', 'R'): 4, ('D', 'S'): 4, ('D', 'T'): 4, ('D', 'V'): 4, ('D', 'W'): 4, ('D', 'Y'): 4, ('E', 'A'): 4, ('E', 'C'): 4, ('E', 'D'): 2, ('E', 'E'): 0, ('E', 'F'): 4, ('E', 'G'): 4, ('E', 'H'): 4, ('E', 'I'): 4, ('E', 'K'): 3, ('E', 'L'): 4, ('E', 'M'): 4, ('E', 'N'): 4, ('E', 'P'): 4, ('E', 'Q'): 2, ('E', 'R'): 4, ('E', 'S'): 4, ('E', 'T'): 4, ('E', 'V'): 4, ('E', 'W'): 4, ('E', 'Y'): 4, ('F', 'A'): 4, ('F', 'C'): 4, ('F', 'D'): 4, ('F', 'E'): 4, ('F', 'F'): 0, ('F', 'G'): 4, ('F', 'H'): 4, ('F', 'I'): 4, ('F', 'K'): 4, ('F', 'L'): 4, ('F', 'M'): 4, ('F', 'N'): 4, ('F', 'P'): 4, ('F', 'Q'): 4, ('F', 'R'): 4, ('F', 'S'): 4, ('F', 'T'): 4, ('F', 'V'): 4, ('F', 'W'): 3, ('F', 'Y'): 1, ('G', 'A'): 4, ('G', 'C'): 4, ('G', 'D'): 4, ('G', 'E'): 4, ('G', 'F'): 4, ('G', 'G'): 0, ('G', 'H'): 4, ('G', 'I'): 4, ('G', 'K'): 4, ('G', 'L'): 4, ('G', 'M'): 4, ('G', 'N'): 4, ('G', 'P'): 4, ('G', 'Q'): 4, ('G', 'R'): 4, ('G', 'S'): 4, ('G', 'T'): 4, ('G', 'V'): 4, ('G', 'W'): 4, ('G', 'Y'): 4, ('H', 'A'): 4, ('H', 'C'): 4, ('H', 'D'): 4, ('H', 'E'): 4, ('H', 'F'): 4, ('H', 'G'): 4, ('H', 'H'): 0, ('H', 'I'): 4, ('H', 'K'): 4, ('H', 'L'): 4, ('H', 'M'): 4, ('H', 'N'): 3, ('H', 'P'): 4, ('H', 'Q'): 4, ('H', 'R'): 4, ('H', 'S'): 4, ('H', 'T'): 4, ('H', 'V'): 4, ('H', 'W'): 4, ('H', 'Y'): 2, ('I', 'A'): 4, ('I', 'C'): 4, ('I', 'D'): 4, ('I', 'E'): 4, ('I', 'F'): 4, ('I', 'G'): 4, ('I', 'H'): 4, ('I', 'I'): 0, ('I', 'K'): 4, ('I', 'L'): 2, ('I', 'M'): 3, ('I', 'N'): 4, ('I', 'P'): 4, ('I', 'Q'): 4, ('I', 'R'): 4, ('I', 'S'): 4, ('I', 'T'): 4, ('I', 'V'): 1, ('I', 'W'): 4, ('I', 'Y'): 4, ('K', 'A'): 4, ('K', 'C'): 4, ('K', 'D'): 4, ('K', 'E'): 3, ('K', 'F'): 4, ('K', 'G'): 4, ('K', 'H'): 4, ('K', 'I'): 4, ('K', 'K'): 0, ('K', 'L'): 4, ('K', 'M'): 4, ('K', 'N'): 4, ('K', 'P'): 4, ('K', 'Q'): 3, ('K', 'R'): 2, ('K', 'S'): 4, ('K', 'T'): 4, ('K', 'V'): 4, ('K', 'W'): 4, ('K', 'Y'): 4, ('L', 'A'): 4, ('L', 'C'): 4, ('L', 'D'): 4, ('L', 'E'): 4, ('L', 'F'): 4, ('L', 'G'): 4, ('L', 'H'): 4, ('L', 'I'): 2, ('L', 'K'): 4, ('L', 'L'): 0, ('L', 'M'): 2, ('L', 'N'): 4, ('L', 'P'): 4, ('L', 'Q'): 4, ('L', 'R'): 4, ('L', 'S'): 4, ('L', 'T'): 4, ('L', 'V'): 3, ('L', 'W'): 4, ('L', 'Y'): 4, ('M', 'A'): 4, ('M', 'C'): 4, ('M', 'D'): 4, ('M', 'E'): 4, ('M', 'F'): 4, ('M', 'G'): 4, ('M', 'H'): 4, ('M', 'I'): 3, ('M', 'K'): 4, ('M', 'L'): 2, ('M', 'M'): 0, ('M', 'N'): 4, ('M', 'P'): 4, ('M', 'Q'): 4, ('M', 'R'): 4, ('M', 'S'): 4, ('M', 'T'): 4, ('M', 'V'): 3, ('M', 'W'): 4, ('M', 'Y'): 4, ('N', 'A'): 4, ('N', 'C'): 4, ('N', 'D'): 3, ('N', 'E'): 4, ('N', 'F'): 4, ('N', 'G'): 4, ('N', 'H'): 3, ('N', 'I'): 4, ('N', 'K'): 4, ('N', 'L'): 4, ('N', 'M'): 4, ('N', 'N'): 0, ('N', 'P'): 4, ('N', 'Q'): 4, ('N', 'R'): 4, ('N', 'S'): 3, ('N', 'T'): 4, ('N', 'V'): 4, ('N', 'W'): 4, ('N', 'Y'): 4, ('P', 'A'): 4, ('P', 'C'): 4, ('P', 'D'): 4, ('P', 'E'): 4, ('P', 'F'): 4, ('P', 'G'): 4, ('P', 'H'): 4, ('P', 'I'): 4, ('P', 'K'): 4, ('P', 'L'): 4, ('P', 'M'): 4, ('P', 'N'): 4, ('P', 'P'): 0, ('P', 'Q'): 4, ('P', 'R'): 4, ('P', 'S'): 4, ('P', 'T'): 4, ('P', 'V'): 4, ('P', 'W'): 4, ('P', 'Y'): 4, ('Q', 'A'): 4, ('Q', 'C'): 4, ('Q', 'D'): 4, ('Q', 'E'): 2, ('Q', 'F'): 4, ('Q', 'G'): 4, ('Q', 'H'): 4, ('Q', 'I'): 4, ('Q', 'K'): 3, ('Q', 'L'): 4, ('Q', 'M'): 4, ('Q', 'N'): 4, ('Q', 'P'): 4, ('Q', 'Q'): 0, ('Q', 'R'): 3, ('Q', 'S'): 4, ('Q', 'T'): 4, ('Q', 'V'): 4, ('Q', 'W'): 4, ('Q', 'Y'): 4, ('R', 'A'): 4, ('R', 'C'): 4, ('R', 'D'): 4, ('R', 'E'): 4, ('R', 'F'): 4, ('R', 'G'): 4, ('R', 'H'): 4, ('R', 'I'): 4, ('R', 'K'): 2, ('R', 'L'): 4, ('R', 'M'): 4, ('R', 'N'): 4, ('R', 'P'): 4, ('R', 'Q'): 3, ('R', 'R'): 0, ('R', 'S'): 4, ('R', 'T'): 4, ('R', 'V'): 4, ('R', 'W'): 4, ('R', 'Y'): 4, ('S', 'A'): 3, ('S', 'C'): 4, ('S', 'D'): 4, ('S', 'E'): 4, ('S', 'F'): 4, ('S', 'G'): 4, ('S', 'H'): 4, ('S', 'I'): 4, ('S', 'K'): 4, ('S', 'L'): 4, ('S', 'M'): 4, ('S', 'N'): 3, ('S', 'P'): 4, ('S', 'Q'): 4, ('S', 'R'): 4, ('S', 'S'): 0, ('S', 'T'): 3, ('S', 'V'): 4, ('S', 'W'): 4, ('S', 'Y'): 4, ('T', 'A'): 4, ('T', 'C'): 4, ('T', 'D'): 4, ('T', 'E'): 4, ('T', 'F'): 4, ('T', 'G'): 4, ('T', 'H'): 4, ('T', 'I'): 4, ('T', 'K'): 4, ('T', 'L'): 4, ('T', 'M'): 4, ('T', 'N'): 4, ('T', 'P'): 4, ('T', 'Q'): 4, ('T', 'R'): 4, ('T', 'S'): 3, ('T', 'T'): 0, ('T', 'V'): 4, ('T', 'W'): 4, ('T', 'Y'): 4, ('V', 'A'): 4, ('V', 'C'): 4, ('V', 'D'): 4, ('V', 'E'): 4, ('V', 'F'): 4, ('V', 'G'): 4, ('V', 'H'): 4, ('V', 'I'): 1, ('V', 'K'): 4, ('V', 'L'): 3, ('V', 'M'): 3, ('V', 'N'): 4, ('V', 'P'): 4, ('V', 'Q'): 4, ('V', 'R'): 4, ('V', 'S'): 4, ('V', 'T'): 4, ('V', 'V'): 0, ('V', 'W'): 4, ('V', 'Y'): 4, ('W', 'A'): 4, ('W', 'C'): 4, ('W', 'D'): 4, ('W', 'E'): 4, ('W', 'F'): 3, ('W', 'G'): 4, ('W', 'H'): 4, ('W', 'I'): 4, ('W', 'K'): 4, ('W', 'L'): 4, ('W', 'M'): 4, ('W', 'N'): 4, ('W', 'P'): 4, ('W', 'Q'): 4, ('W', 'R'): 4, ('W', 'S'): 4, ('W', 'T'): 4, ('W', 'V'): 4, ('W', 'W'): 0, ('W', 'Y'): 2, ('Y', 'A'): 4, ('Y', 'C'): 4, ('Y', 'D'): 4, ('Y', 'E'): 4, ('Y', 'F'): 1, ('Y', 'G'): 4, ('Y', 'H'): 2, ('Y', 'I'): 4, ('Y', 'K'): 4, ('Y', 'L'): 4, ('Y', 'M'): 4, ('Y', 'N'): 4, ('Y', 'P'): 4, ('Y', 'Q'): 4, ('Y', 'R'): 4, ('Y', 'S'): 4, ('Y', 'T'): 4, ('Y', 'V'): 4, ('Y', 'W'): 2, ('Y', 'Y'): 0}
blosum62_distance_matrix = {('A', 'A'): 0, ('A', 'C'): 4, ('A', 'D'): 4, ('A', 'E'): 4, ('A', 'F'): 4, ('A', 'G'): 4, ('A', 'H'): 4, ('A', 'I'): 4, ('A', 'K'): 4, ('A', 'L'): 4, ('A', 'M'): 4, ('A', 'N'): 4, ('A', 'P'): 4, ('A', 'Q'): 4, ('A', 'R'): 4, ('A', 'S'): 3, ('A', 'T'): 4, ('A', 'V'): 4, ('A', 'W'): 4, ('A', 'Y'): 4, ('C', 'A'): 4, ('C', 'C'): 0, ('C', 'D'): 4, ('C', 'E'): 4, ('C', 'F'): 4, ('C', 'G'): 4, ('C', 'H'): 4, ('C', 'I'): 4, ('C', 'K'): 4, ('C', 'L'): 4, ('C', 'M'): 4, ('C', 'N'): 4, ('C', 'P'): 4, ('C', 'Q'): 4, ('C', 'R'): 4, ('C', 'S'): 4, ('C', 'T'): 4, ('C', 'V'): 4, ('C', 'W'): 4, ('C', 'Y'): 4, ('D', 'A'): 4, ('D', 'C'): 4, ('D', 'D'): 0, ('D', 'E'): 2, ('D', 'F'): 4, ('D', 'G'): 4, ('D', 'H'): 4, ('D', 'I'): 4, ('D', 'K'): 4, ('D', 'L'): 4, ('D', 'M'): 4, ('D', 'N'): 3, ('D', 'P'): 4, ('D', 'Q'): 4, ('D', 'R'): 4, ('D', 'S'): 4, ('D', 'T'): 4, ('D', 'V'): 4, ('D', 'W'): 4, ('D', 'Y'): 4, ('E', 'A'): 4, ('E', 'C'): 4, ('E', 'D'): 2, ('E', 'E'): 0, ('E', 'F'): 4, ('E', 'G'): 4, ('E', 'H'): 4, ('E', 'I'): 4, ('E', 'K'): 3, ('E', 'L'): 4, ('E', 'M'): 4, ('E', 'N'): 4, ('E', 'P'): 4, ('E', 'Q'): 2, ('E', 'R'): 4, ('E', 'S'): 4, ('E', 'T'): 4, ('E', 'V'): 4, ('E', 'W'): 4, ('E', 'Y'): 4, ('F', 'A'): 4, ('F', 'C'): 4, ('F', 'D'): 4, ('F', 'E'): 4, ('F', 'F'): 0, ('F', 'G'): 4, ('F', 'H'): 4, ('F', 'I'): 4, ('F', 'K'): 4, ('F', 'L'): 4, ('F', 'M'): 4, ('F', 'N'): 4, ('F', 'P'): 4, ('F', 'Q'): 4, ('F', 'R'): 4, ('F', 'S'): 4, ('F', 'T'): 4, ('F', 'V'): 4, ('F', 'W'): 3, ('F', 'Y'): 1, ('G', 'A'): 4, ('G', 'C'): 4, ('G', 'D'): 4, ('G', 'E'): 4, ('G', 'F'): 4, ('G', 'G'): 0, ('G', 'H'): 4, ('G', 'I'): 4, ('G', 'K'): 4, ('G', 'L'): 4, ('G', 'M'): 4, ('G', 'N'): 4, ('G', 'P'): 4, ('G', 'Q'): 4, ('G', 'R'): 4, ('G', 'S'): 4, ('G', 'T'): 4, ('G', 'V'): 4, ('G', 'W'): 4, ('G', 'Y'): 4, ('H', 'A'): 4, ('H', 'C'): 4, ('H', 'D'): 4, ('H', 'E'): 4, ('H', 'F'): 4, ('H', 'G'): 4, ('H', 'H'): 0, ('H', 'I'): 4, ('H', 'K'): 4, ('H', 'L'): 4, ('H', 'M'): 4, ('H', 'N'): 3, ('H', 'P'): 4, ('H', 'Q'): 4, ('H', 'R'): 4, ('H', 'S'): 4, ('H', 'T'): 4, ('H', 'V'): 4, ('H', 'W'): 4, ('H', 'Y'): 2, ('I', 'A'): 4, ('I', 'C'): 4, ('I', 'D'): 4, ('I', 'E'): 4, ('I', 'F'): 4, ('I', 'G'): 4, ('I', 'H'): 4, ('I', 'I'): 0, ('I', 'K'): 4, ('I', 'L'): 2, ('I', 'M'): 3, ('I', 'N'): 4, ('I', 'P'): 4, ('I', 'Q'): 4, ('I', 'R'): 4, ('I', 'S'): 4, ('I', 'T'): 4, ('I', 'V'): 1, ('I', 'W'): 4, ('I', 'Y'): 4, ('K', 'A'): 4, ('K', 'C'): 4, ('K', 'D'): 4, ('K', 'E'): 3, ('K', 'F'): 4, ('K', 'G'): 4, ('K', 'H'): 4, ('K', 'I'): 4, ('K', 'K'): 0, ('K', 'L'): 4, ('K', 'M'): 4, ('K', 'N'): 4, ('K', 'P'): 4, ('K', 'Q'): 3, ('K', 'R'): 2, ('K', 'S'): 4, ('K', 'T'): 4, ('K', 'V'): 4, ('K', 'W'): 4, ('K', 'Y'): 4, ('L', 'A'): 4, ('L', 'C'): 4, ('L', 'D'): 4, ('L', 'E'): 4, ('L', 'F'): 4, ('L', 'G'): 4, ('L', 'H'): 4, ('L', 'I'): 2, ('L', 'K'): 4, ('L', 'L'): 0, ('L', 'M'): 2, ('L', 'N'): 4, ('L', 'P'): 4, ('L', 'Q'): 4, ('L', 'R'): 4, ('L', 'S'): 4, ('L', 'T'): 4, ('L', 'V'): 3, ('L', 'W'): 4, ('L', 'Y'): 4, ('M', 'A'): 4, ('M', 'C'): 4, ('M', 'D'): 4, ('M', 'E'): 4, ('M', 'F'): 4, ('M', 'G'): 4, ('M', 'H'): 4, ('M', 'I'): 3, ('M', 'K'): 4, ('M', 'L'): 2, ('M', 'M'): 0, ('M', 'N'): 4, ('M', 'P'): 4, ('M', 'Q'): 4, ('M', 'R'): 4, ('M', 'S'): 4, ('M', 'T'): 4, ('M', 'V'): 3, ('M', 'W'): 4, ('M', 'Y'): 4, ('N', 'A'): 4, ('N', 'C'): 4, ('N', 'D'): 3, ('N', 'E'): 4, ('N', 'F'): 4, ('N', 'G'): 4, ('N', 'H'): 3, ('N', 'I'): 4, ('N', 'K'): 4, ('N', 'L'): 4, ('N', 'M'): 4, ('N', 'N'): 0, ('N', 'P'): 4, ('N', 'Q'): 4, ('N', 'R'): 4, ('N', 'S'): 3, ('N', 'T'): 4, ('N', 'V'): 4, ('N', 'W'): 4, ('N', 'Y'): 4, ('P', 'A'): 4, ('P', 'C'): 4, ('P', 'D'): 4, ('P', 'E'): 4, ('P', 'F'): 4, ('P', 'G'): 4, ('P', 'H'): 4, ('P', 'I'): 4, ('P', 'K'): 4, ('P', 'L'): 4, ('P', 'M'): 4, ('P', 'N'): 4, ('P', 'P'): 0, ('P', 'Q'): 4, ('P', 'R'): 4, ('P', 'S'): 4, ('P', 'T'): 4, ('P', 'V'): 4, ('P', 'W'): 4, ('P', 'Y'): 4, ('Q', 'A'): 4, ('Q', 'C'): 4, ('Q', 'D'): 4, ('Q', 'E'): 2, ('Q', 'F'): 4, ('Q', 'G'): 4, ('Q', 'H'): 4, ('Q', 'I'): 4, ('Q', 'K'): 3, ('Q', 'L'): 4, ('Q', 'M'): 4, ('Q', 'N'): 4, ('Q', 'P'): 4, ('Q', 'Q'): 0, ('Q', 'R'): 3, ('Q', 'S'): 4, ('Q', 'T'): 4, ('Q', 'V'): 4, ('Q', 'W'): 4, ('Q', 'Y'): 4, ('R', 'A'): 4, ('R', 'C'): 4, ('R', 'D'): 4, ('R', 'E'): 4, ('R', 'F'): 4, ('R', 'G'): 4, ('R', 'H'): 4, ('R', 'I'): 4, ('R', 'K'): 2, ('R', 'L'): 4, ('R', 'M'): 4, ('R', 'N'): 4, ('R', 'P'): 4, ('R', 'Q'): 3, ('R', 'R'): 0, ('R', 'S'): 4, ('R', 'T'): 4, ('R', 'V'): 4, ('R', 'W'): 4, ('R', 'Y'): 4, ('S', 'A'): 3, ('S', 'C'): 4, ('S', 'D'): 4, ('S', 'E'): 4, ('S', 'F'): 4, ('S', 'G'): 4, ('S', 'H'): 4, ('S', 'I'): 4, ('S', 'K'): 4, ('S', 'L'): 4, ('S', 'M'): 4, ('S', 'N'): 3, ('S', 'P'): 4, ('S', 'Q'): 4, ('S', 'R'): 4, ('S', 'S'): 0, ('S', 'T'): 3, ('S', 'V'): 4, ('S', 'W'): 4, ('S', 'Y'): 4, ('T', 'A'): 4, ('T', 'C'): 4, ('T', 'D'): 4, ('T', 'E'): 4, ('T', 'F'): 4, ('T', 'G'): 4, ('T', 'H'): 4, ('T', 'I'): 4, ('T', 'K'): 4, ('T', 'L'): 4, ('T', 'M'): 4, ('T', 'N'): 4, ('T', 'P'): 4, ('T', 'Q'): 4, ('T', 'R'): 4, ('T', 'S'): 3, ('T', 'T'): 0, ('T', 'V'): 4, ('T', 'W'): 4, ('T', 'Y'): 4, ('V', 'A'): 4, ('V', 'C'): 4, ('V', 'D'): 4, ('V', 'E'): 4, ('V', 'F'): 4, ('V', 'G'): 4, ('V', 'H'): 4, ('V', 'I'): 1, ('V', 'K'): 4, ('V', 'L'): 3, ('V', 'M'): 3, ('V', 'N'): 4, ('V', 'P'): 4, ('V', 'Q'): 4, ('V', 'R'): 4, ('V', 'S'): 4, ('V', 'T'): 4, ('V', 'V'): 0, ('V', 'W'): 4, ('V', 'Y'): 4, ('W', 'A'): 4, ('W', 'C'): 4, ('W', 'D'): 4, ('W', 'E'): 4, ('W', 'F'): 3, ('W', 'G'): 4, ('W', 'H'): 4, ('W', 'I'): 4, ('W', 'K'): 4, ('W', 'L'): 4, ('W', 'M'): 4, ('W', 'N'): 4, ('W', 'P'): 4, ('W', 'Q'): 4, ('W', 'R'): 4, ('W', 'S'): 4, ('W', 'T'): 4, ('W', 'V'): 4, ('W', 'W'): 0, ('W', 'Y'): 2, ('Y', 'A'): 4, ('Y', 'C'): 4, ('Y', 'D'): 4, ('Y', 'E'): 4, ('Y', 'F'): 1, ('Y', 'G'): 4, ('Y', 'H'): 2, ('Y', 'I'): 4, ('Y', 'K'): 4, ('Y', 'L'): 4, ('Y', 'M'): 4, ('Y', 'N'): 4, ('Y', 'P'): 4, ('Y', 'Q'): 4, ('Y', 'R'): 4, ('Y', 'S'): 4, ('Y', 'T'): 4, ('Y', 'V'): 4, ('Y', 'W'): 2, ('Y', 'Y'): 0}
tcrblosum_alpha_distance_matrix = {('A', 'A'): 0, ('A', 'R'): 4, ('A', 'N'): 4, ('A', 'D'): 4, ('A', 'C'): 4, ('A', 'Q'): 4, ('A', 'E'): 4, ('A', 'G'): 4, ('A', 'H'): 4, ('A', 'I'): 4, ('A', 'L'): 4, ('A', 'K'): 4, ('A', 'M'): 4, ('A', 'F'): 4, ('A', 'P'): 4, ('A', 'S'): 4, ('A', 'T'): 4, ('A', 'W'): 4, ('A', 'Y'): 4, ('A', 'V'): 4, ('R', 'A'): 4, ('R', 'R'): 0, ('R', 'N'): 4, ('R', 'D'): 4, ('R', 'C'): 3, ('R', 'Q'): 4, ('R', 'E'): 4, ('R', 'G'): 4, ('R', 'H'): 4, ('R', 'I'): 4, ('R', 'L'): 4, ('R', 'K'): 4, ('R', 'M'): 4, ('R', 'F'): 4, ('R', 'P'): 4, ('R', 'S'): 4, ('R', 'T'): 4, ('R', 'W'): 4, ('R', 'Y'): 4, ('R', 'V'): 4, ('N', 'A'): 4, ('N', 'R'): 4, ('N', 'N'): 0, ('N', 'D'): 4, ('N', 'C'): 4, ('N', 'Q'): 4, ('N', 'E'): 4, ('N', 'G'): 4, ('N', 'H'): 4, ('N', 'I'): 4, ('N', 'L'): 4, ('N', 'K'): 3, ('N', 'M'): 4, ('N', 'F'): 4, ('N', 'P'): 4, ('N', 'S'): 4, ('N', 'T'): 4, ('N', 'W'): 4, ('N', 'Y'): 4, ('N', 'V'): 4, ('D', 'A'): 4, ('D', 'R'): 4, ('D', 'N'): 4, ('D', 'D'): 0, ('D', 'C'): 4, ('D', 'Q'): 4, ('D', 'E'): 4, ('D', 'G'): 4, ('D', 'H'): 4, ('D', 'I'): 4, ('D', 'L'): 4, ('D', 'K'): 4, ('D', 'M'): 4, ('D', 'F'): 4, ('D', 'P'): 4, ('D', 'S'): 4, ('D', 'T'): 4, ('D', 'W'): 4, ('D', 'Y'): 4, ('D', 'V'): 4, ('C', 'A'): 4, ('C', 'R'): 3, ('C', 'N'): 4, ('C', 'D'): 4, ('C', 'C'): 0, ('C', 'Q'): 4, ('C', 'E'): 4, ('C', 'G'): 4, ('C', 'H'): 4, ('C', 'I'): 4, ('C', 'L'): 4, ('C', 'K'): 4, ('C', 'M'): 4, ('C', 'F'): 4, ('C', 'P'): 4, ('C', 'S'): 4, ('C', 'T'): 4, ('C', 'W'): 4, ('C', 'Y'): 4, ('C', 'V'): 4, ('Q', 'A'): 4, ('Q', 'R'): 4, ('Q', 'N'): 4, ('Q', 'D'): 4, ('Q', 'C'): 4, ('Q', 'Q'): 0, ('Q', 'E'): 4, ('Q', 'G'): 4, ('Q', 'H'): 4, ('Q', 'I'): 4, ('Q', 'L'): 4, ('Q', 'K'): 3, ('Q', 'M'): 4, ('Q', 'F'): 4, ('Q', 'P'): 4, ('Q', 'S'): 4, ('Q', 'T'): 4, ('Q', 'W'): 4, ('Q', 'Y'): 4, ('Q', 'V'): 4, ('E', 'A'): 4, ('E', 'R'): 4, ('E', 'N'): 4, ('E', 'D'): 4, ('E', 'C'): 4, ('E', 'Q'): 4, ('E', 'E'): 0, ('E', 'G'): 4, ('E', 'H'): 3, ('E', 'I'): 4, ('E', 'L'): 4, ('E', 'K'): 4, ('E', 'M'): 4, ('E', 'F'): 4, ('E', 'P'): 4, ('E', 'S'): 4, ('E', 'T'): 4, ('E', 'W'): 4, ('E', 'Y'): 4, ('E', 'V'): 4, ('G', 'A'): 4, ('G', 'R'): 4, ('G', 'N'): 4, ('G', 'D'): 4, ('G', 'C'): 4, ('G', 'Q'): 4, ('G', 'E'): 4, ('G', 'G'): 0, ('G', 'H'): 4, ('G', 'I'): 4, ('G', 'L'): 4, ('G', 'K'): 4, ('G', 'M'): 4, ('G', 'F'): 4, ('G', 'P'): 4, ('G', 'S'): 4, ('G', 'T'): 4, ('G', 'W'): 4, ('G', 'Y'): 4, ('G', 'V'): 4, ('H', 'A'): 4, ('H', 'R'): 4, ('H', 'N'): 4, ('H', 'D'): 4, ('H', 'C'): 4, ('H', 'Q'): 4, ('H', 'E'): 3, ('H', 'G'): 4, ('H', 'H'): 0, ('H', 'I'): 4, ('H', 'L'): 4, ('H', 'K'): 4, ('H', 'M'): 4, ('H', 'F'): 4, ('H', 'P'): 3, ('H', 'S'): 4, ('H', 'T'): 4, ('H', 'W'): 3, ('H', 'Y'): 4, ('H', 'V'): 4, ('I', 'A'): 4, ('I', 'R'): 4, ('I', 'N'): 4, ('I', 'D'): 4, ('I', 'C'): 4, ('I', 'Q'): 4, ('I', 'E'): 4, ('I', 'G'): 4, ('I', 'H'): 4, ('I', 'I'): 0, ('I', 'L'): 4, ('I', 'K'): 4, ('I', 'M'): 4, ('I', 'F'): 4, ('I', 'P'): 4, ('I', 'S'): 4, ('I', 'T'): 3, ('I', 'W'): 4, ('I', 'Y'): 4, ('I', 'V'): 4, ('L', 'A'): 4, ('L', 'R'): 4, ('L', 'N'): 4, ('L', 'D'): 4, ('L', 'C'): 4, ('L', 'Q'): 4, ('L', 'E'): 4, ('L', 'G'): 4, ('L', 'H'): 4, ('L', 'I'): 4, ('L', 'L'): 0, ('L', 'K'): 4, ('L', 'M'): 4, ('L', 'F'): 3, ('L', 'P'): 4, ('L', 'S'): 4, ('L', 'T'): 4, ('L', 'W'): 4, ('L', 'Y'): 4, ('L', 'V'): 4, ('K', 'A'): 4, ('K', 'R'): 4, ('K', 'N'): 3, ('K', 'D'): 4, ('K', 'C'): 4, ('K', 'Q'): 3, ('K', 'E'): 4, ('K', 'G'): 4, ('K', 'H'): 4, ('K', 'I'): 4, ('K', 'L'): 4, ('K', 'K'): 0, ('K', 'M'): 4, ('K', 'F'): 4, ('K', 'P'): 4, ('K', 'S'): 4, ('K', 'T'): 4, ('K', 'W'): 4, ('K', 'Y'): 4, ('K', 'V'): 4, ('M', 'A'): 4, ('M', 'R'): 4, ('M', 'N'): 4, ('M', 'D'): 4, ('M', 'C'): 4, ('M', 'Q'): 4, ('M', 'E'): 4, ('M', 'G'): 4, ('M', 'H'): 4, ('M', 'I'): 4, ('M', 'L'): 4, ('M', 'K'): 4, ('M', 'M'): 0, ('M', 'F'): 4, ('M', 'P'): 4, ('M', 'S'): 4, ('M', 'T'): 4, ('M', 'W'): 4, ('M', 'Y'): 4, ('M', 'V'): 4, ('F', 'A'): 4, ('F', 'R'): 4, ('F', 'N'): 4, ('F', 'D'): 4, ('F', 'C'): 4, ('F', 'Q'): 4, ('F', 'E'): 4, ('F', 'G'): 4, ('F', 'H'): 4, ('F', 'I'): 4, ('F', 'L'): 3, ('F', 'K'): 4, ('F', 'M'): 4, ('F', 'F'): 0, ('F', 'P'): 4, ('F', 'S'): 4, ('F', 'T'): 4, ('F', 'W'): 4, ('F', 'Y'): 4, ('F', 'V'): 4, ('P', 'A'): 4, ('P', 'R'): 4, ('P', 'N'): 4, ('P', 'D'): 4, ('P', 'C'): 4, ('P', 'Q'): 4, ('P', 'E'): 4, ('P', 'G'): 4, ('P', 'H'): 3, ('P', 'I'): 4, ('P', 'L'): 4, ('P', 'K'): 4, ('P', 'M'): 4, ('P', 'F'): 4, ('P', 'P'): 0, ('P', 'S'): 4, ('P', 'T'): 4, ('P', 'W'): 4, ('P', 'Y'): 4, ('P', 'V'): 4, ('S', 'A'): 4, ('S', 'R'): 4, ('S', 'N'): 4, ('S', 'D'): 4, ('S', 'C'): 4, ('S', 'Q'): 4, ('S', 'E'): 4, ('S', 'G'): 4, ('S', 'H'): 4, ('S', 'I'): 4, ('S', 'L'): 4, ('S', 'K'): 4, ('S', 'M'): 4, ('S', 'F'): 4, ('S', 'P'): 4, ('S', 'S'): 0, ('S', 'T'): 4, ('S', 'W'): 4, ('S', 'Y'): 4, ('S', 'V'): 4, ('T', 'A'): 4, ('T', 'R'): 4, ('T', 'N'): 4, ('T', 'D'): 4, ('T', 'C'): 4, ('T', 'Q'): 4, ('T', 'E'): 4, ('T', 'G'): 4, ('T', 'H'): 4, ('T', 'I'): 3, ('T', 'L'): 4, ('T', 'K'): 4, ('T', 'M'): 4, ('T', 'F'): 4, ('T', 'P'): 4, ('T', 'S'): 4, ('T', 'T'): 0, ('T', 'W'): 4, ('T', 'Y'): 4, ('T', 'V'): 4, ('W', 'A'): 4, ('W', 'R'): 4, ('W', 'N'): 4, ('W', 'D'): 4, ('W', 'C'): 4, ('W', 'Q'): 4, ('W', 'E'): 4, ('W', 'G'): 4, ('W', 'H'): 3, ('W', 'I'): 4, ('W', 'L'): 4, ('W', 'K'): 4, ('W', 'M'): 4, ('W', 'F'): 4, ('W', 'P'): 4, ('W', 'S'): 4, ('W', 'T'): 4, ('W', 'W'): 0, ('W', 'Y'): 4, ('W', 'V'): 4, ('Y', 'A'): 4, ('Y', 'R'): 4, ('Y', 'N'): 4, ('Y', 'D'): 4, ('Y', 'C'): 4, ('Y', 'Q'): 4, ('Y', 'E'): 4, ('Y', 'G'): 4, ('Y', 'H'): 4, ('Y', 'I'): 4, ('Y', 'L'): 4, ('Y', 'K'): 4, ('Y', 'M'): 4, ('Y', 'F'): 4, ('Y', 'P'): 4, ('Y', 'S'): 4, ('Y', 'T'): 4, ('Y', 'W'): 4, ('Y', 'Y'): 0, ('Y', 'V'): 4, ('V', 'A'): 4, ('V', 'R'): 4, ('V', 'N'): 4, ('V', 'D'): 4, ('V', 'C'): 4, ('V', 'Q'): 4, ('V', 'E'): 4, ('V', 'G'): 4, ('V', 'H'): 4, ('V', 'I'): 4, ('V', 'L'): 4, ('V', 'K'): 4, ('V', 'M'): 4, ('V', 'F'): 4, ('V', 'P'): 4, ('V', 'S'): 4, ('V', 'T'): 4, ('V', 'W'): 4, ('V', 'Y'): 4, ('V', 'V'): 0}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you make any modifications to the original TCRblosum matrix? e.g. the cap at 4?

Copy link
Copy Markdown
Collaborator Author

@felixpetschko felixpetschko Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, you always have this conversion from the substitution matrix (blosum62, tcrblosum) to a distance matrix.
In the tcrblosum paper , they just menttion "Firstly, we transformed the tcrBLOSUM similarity matrix into a distance matrix according to the rules of TCRdist [12]".

In the original TCRdist paper they write:
"The mismatch distance is defined based on the BLOSUM62 (ref. 37) substitution matrix as follows: distance (a, a)=0; distance (a, b)=min (4, 4-BLOSUM62 (a, b)), where 4 is 1 unit greater than the most favourable BLOSUM62 score for a mismatch, and a and b are amino acids".

Now there would be two options:

  1. also use the constant cap 4 like in the original tcrdist paper
  2. use a cap that is one unit greater than the most favourable score

However, the most favourable score would be 1 for tcrblosum_alpha and 2 for tcrblosum_beta which would result in a quite low cap (see matrices in notebook). Therefore I just went with the fixed cap of 4 for the transformation formula from substitution matrix to distance matrix of the original tcrdist implementation:
distance(a, a) = 0 and distance(a, b) = min(4, 4 - score)

Unfortunately, I wasn't able to find out how they did it exactly in the tcrblosum paper. Another thing to consider is, that the distance values for alpha and beta chain might be compared later during the clonotype clustering. Therefore having two different caps might cause problems.

What do you think would be the best option?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dear @apostovskaya,

we are working on integrating your tcrBLOSUM matrix into scirpy, which contains a reimplementation of the TCRdist algorithm.

We are unsure how you intended the matrix to be used with TCRdist. Could you please clarify what you mean with

Firstly, we transformed the tcrBLOSUM similarity matrix into a distance matrix according to the rules of TCRdist [12]

in your paper and how you would suggest to set the cap?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The mismatch distance is defined based on the BLOSUM62 (ref. 37) substitution matrix as follows: distance (a, a)=0; distance (a, b)=min (4, 4-BLOSUM62 (a, b)), where 4 is 1 unit greater than the most favourable BLOSUM62 score for a mismatch, and a and b are amino acids".

tbh, I never understood this part about TCRdist. It skews the matrix quite a bit, basically assigning a distance of 4 to all mismatches that have a negative distance in BLOSUM62 (even if it's just -1). So the only way to get a score $1 \leq s \leq 3$ is if you have one of the rare pairs with a positive mismatch score according to BLOSUM62.

Having a cap of 2 with TCRblosum doesn't make sense to me, because then the strong negative effect of mismatches with C would be completely gone which defeats the purpose of TCRblosum. Using 4 is just as arbitrary.

@felixpetschko
Copy link
Copy Markdown
Collaborator Author

I updated the approach now according to what was discussed via email. The distance matrices are now computed from the original substitution matrices (blosum62, tcrblosum_alpha, tcrblosum_beta) rather than stored as precomputed distance tables.

In general, the calculation is now done as follows:
distance(a, a) = 0 and distance(a, b) = min(distance_cap, distance_offset - score)

The base_matrix(Literal["blosum62", "tcrblosum"] = "blosum62") and the distance_cap(int | None | Literal["default"] = "default") parameters can be set in the calling ir_dist method. For blosum62 distance_cap=4 by default and distance_offset=4 as always. For the tcrblosum matrices, distance_cap=None (no cap) by default and distance_offset=3=(global maximum off-diagonal substitution score of both alpha and beta matrices +1). This distance offset maps the most favorable off-diagonal score across both tcrblosum matrices (2) to distance 1. That way, the blosum62 defaults follow the original TCRdist paper while the tcrblosum defaults should preserve the signal present in the more sparse tcrblosum matrices.

My intention for the user call would be the following rather than creating an additional tcrdist_tcrblosum metric:

tcrdist with blosum62

ir.pp.ir_dist(
    mdata,
    metric="tcrdist",
    sequence="aa",
    cutoff=15,
    #base_matrix="blosum62", #set by default
    #distance_cap=4, #set by default
)

tcrdist with tcrblosum matrices

ir.pp.ir_dist(
    mdata,
    metric="tcrdist",
    sequence="aa",
    cutoff=15,
    base_matrix="tcrblosum",
    #distance_cap=None, #set by default
)

@felixpetschko felixpetschko marked this pull request as ready for review May 12, 2026 07:16
@felixpetschko felixpetschko requested a review from grst May 12, 2026 07:16
distance_matrix:
distance lookup matrix
"""
dm = np.zeros((len(alphabet), len(alphabet)), dtype=np.int32)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this imply that letters in the alphabet, but not in the matrix have a distance of 0 (e.g. X vs. A)?

Is this what we want? Or maybe rather Inf?

FWIW, there's also variants of BLOSUM62 that include these extended amino acids:

Image

Copy link
Copy Markdown
Collaborator Author

@felixpetschko felixpetschko May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this imply that letters in the alphabet, but not in the matrix have a distance of 0 (e.g. X vs. A)?

Yes - this behavior comes from the original pwseqdist base implementation. Maybe this was the pragmatic approach to treat unknown residues as "uninformative" rather than as evidence of dissimilarity. One could see capturing true matches as more important than avoiding false matches.

The question is if we want to change this.

FWIW, there's also variants of BLOSUM62 that include these extended amino acids

However, probably we don't have something like that for the tcrblosum matrices.

Comment thread src/scirpy/ir_dist/metrics.py
@grst
Copy link
Copy Markdown
Collaborator

grst commented May 13, 2026

Implementation-wise this looks good, and I also prefer to make the tcrdist implementation generic such that it works with arbitrary matrices rather than introducing another metric.

We just need to make sure to advertise it appropriately in the tutorial, otherwise the functionality is a bit hidden.

@review-notebook-app
Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@felixpetschko
Copy link
Copy Markdown
Collaborator Author

We just need to make sure to advertise it appropriately in the tutorial, otherwise the functionality is a bit hidden.

I made a short tutorial update to mention tcrblosum usage.

Copy link
Copy Markdown
Collaborator

@grst grst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-gpu-ci runs GPU CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants