augur proximity by jameshadfield · Pull Request #1962 · nextstrain/augur

jameshadfield · 2026-02-15T22:41:52Z

A commonly requested use-case is finding genetically-similar strains for a small target (query) set. We used this extensively in the early years of COVID, and nextstrain/ncov#816 explains the desire to shift this to a generalised augur command.

See comments at the top of the python file for more details about the algorithm, as well as future directions.

This is designed to fit into our regular snakemake workflows as follows:

augur filter creates your query set and potentially also a contextual set if you know filters which will reduce this (e.g. clade, subtype, temporal range etc)
augur proximity creates a strains text file
augur filter consumes this strains file to produce the set of proximal sequences
augur merge combines the samples together for analysis

@victorlin how would you see this best fitting into augur subsample?

Performance

I used a query set of either n=1,000 or n=100 H5N1 PB2 sequences from North America since 2020. The context set was n=343,545, which was all influenza PB2 sequences which nextclade would align to a H5N1 reference under default settings. Sequences were xz compressed.

This allows us to get proximal sequences in a couple of minutes which is well suited for all our analysis workflows except for ncov. (For ncov we can probably subsample the data to get a suitable set of contextual sequences using pango lineages etc.)

Checklist

Automated checks pass
Check if you need to add a changelog message
Check if you need to add tests
Check if you need to update docs

victorlin

Thanks for your patience! Threads below for 2 large refactoring suggestions and 2 nitpicks.

@victorlin

Extends the subsampling schema to allow proximity sampling using the (new) `augur proximity` command. Currently `augur proximity` sampling runs using a single thread, which is non-optional, but a subsequent commit will change this. The added zika alignment was computed using the default alignment settings from our zika repo This commit includes contributions from @victorlin as suggested during review, especially <#1962 (comment)>

@victorlin

Extends the subsampling schema to allow proximity sampling using the (new) `augur proximity` command. Currently `augur proximity` sampling runs using a single thread, which is non-optional, but a subsequent commit will change this. The added zika alignment was computed using the default alignment settings from our zika repo This commit includes contributions from @victorlin as suggested during review, especially <#1962 (comment)>

codecov · 2026-03-11T21:29:52Z

Codecov Report

❌ Patch coverage is 86.30631% with 76 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.06%. Comparing base (b6f862f) to head (8d32d1d).
⚠️ Report is 9 commits behind head on master.

Files with missing lines	Patch %	Lines
augur/proximity.py	80.56%	29 Missing and 12 partials ⚠️
augur/subsample.py	89.82%	17 Missing and 18 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1962      +/-   ##
==========================================
+ Coverage   74.51%   75.06%   +0.55%     
==========================================
  Files          82       83       +1     
  Lines        9204     9708     +504     
  Branches     1870     1969      +99     
==========================================
+ Hits         6858     7287     +429     
- Misses       2038     2084      +46     
- Partials      308      337      +29

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

victorlin

I still haven't looked at the hamming distance code, but this should be it for comments on subsample changes.

@victorlin

Extends the subsampling schema to allow proximity sampling using the (new) `augur proximity` command. Currently `augur proximity` sampling runs using a single thread, which is non-optional, but a subsequent commit will change this. The added zika alignment was computed using the default alignment settings from our zika repo This commit includes contributions from @victorlin as suggested during review, especially <#1962 (comment)>

Requested in code review <#1962 (comment)> "I find the pull method easier to understand/debug with less shared state and no callbacks"

jameshadfield · 2026-03-16T00:54:52Z

CI failures addressed by #1973

victorlin

Some more subsample-related comments and suggestions

joverlee521

Still catching up, just did a first pass through.

@victorlin

Extends the subsampling schema to allow proximity sampling using the (new) `augur proximity` command. Currently `augur proximity` sampling runs using a single thread, which is non-optional, but a subsequent commit will change this. The added zika alignment was computed using the default alignment settings from our zika repo This commit includes contributions from @victorlin as suggested during review, especially <#1962 (comment)>

Requested in code review <#1962 (comment)> "I find the pull method easier to understand/debug with less shared state and no callbacks"

Review suggestion: Debug messages can be simplified with augur.io.print.print_debug <#1962 (comment)>

joverlee521

This looks like it's in a good place to test out in some pathogen workflows. Any issues we encounter there can be added as future changes.

A commonly requested use-case is finding genetically-similar strains for a small target (query) set. We used this extensively in the early years of COVID, and <nextstrain/ncov#816> explains the desire to shift this to a generalised augur command. This is designed to fit into our regular snakemake workflows as follows: * `augur filter` creates your query set and potentially also a contextual set if you know filters which will reduce this (e.g. clade, subtype, temporal range etc) * `augur proximity` creates a strains text file * `augur filter` consumes this strains file to produce the set of proximal sequences * `augur merge` combines the samples together for analysis See comments at the top of the python file for more details about the algorithm, as well as future directions. This uses a lot of the tricks used by <https://github.com/nextstrain/ncov/blob/master/scripts/get_distance_to_focal_set.py> howerver here our focus is on comparing queries directly to the contextual sequences without needing a reference sequence as a comparitor. An important caveat is that this approach loads all contextual sequences into memory (using numpy arrays, so 1 byte per character). This makes it unsuitable for ncov, but it should work for all our other pathogens. See comments in the file for ways to improve this memory bottlenech. Claude code helped with this commit.

This expands the concept of subsampling to allow a sample to use another sample as its inputs. Internally this necessitates a DAG for the samples and this a more complex invocation of parallalism. We add two extra config keys, 'drop_sample' and 'target_sample', which are different from the other config parameters in that they don't map to `augur filter` arguments directly. The usefulness of hierarchical sampling as implemented here is debateable, and while there are examples (e.g. RSV) it's probably not worth the added complexity to `augur subsample`. However the next commit will add proximal subsampling and that needs this functionality, so it makes sense to first implement it for "normal" subsampling. Claude Opus 4.6 used for lots of the code here, but I refactored / commented / changed / added code throughout.

@victorlin

Extends the subsampling schema to allow proximity sampling using the (new) `augur proximity` command. Currently `augur proximity` sampling runs using a single thread, which is non-optional, but a subsequent commit will change this. The added zika alignment was computed using the default alignment settings from our zika repo This commit includes contributions from @victorlin as suggested during review, especially <#1962 (comment)>

Proximity is very parallalisable, and it's likely that proximity sampling steps will be the most computationally expensive part of subsampling schemes. Thus we want to run with multiple threads. To do so requires a more complex design for our concurrency model as we can no longer simply add jobs to the thread pool and let it manage when they actually run. We add a second layer of manual resource (thread) management so that we can run samples with varying resource (thread) requirements.

Requested in code review <#1962 (comment)> "I find the pull method easier to understand/debug with less shared state and no callbacks"

Review suggestion: Debug messages can be simplified with augur.io.print.print_debug <#1962 (comment)>

so that workflows can know whether aligned sequences must be provided. This is to let us prototype workflows where we conditionally align inputs based on the contents of the (customisable) subsampling configs.

victorlin reviewed Feb 17, 2026

View reviewed changes

Comment thread augur/proximity.py Outdated

victorlin mentioned this pull request Feb 25, 2026

subsample: Add helper function for setting --nthreads #1963

Open

5 tasks

jameshadfield force-pushed the james/proximity-sampling branch 3 times, most recently from d9cf947 to 443aa7d Compare February 26, 2026 03:21

victorlin reviewed Mar 10, 2026

View reviewed changes

Comment thread augur/subsample.py Outdated

Comment thread augur/subsample.py Outdated

Comment thread augur/subsample.py Outdated

Comment thread augur/subsample.py Outdated

jameshadfield force-pushed the james/proximity-sampling branch from ef3f36d to 0de072d Compare March 11, 2026 03:18

jameshadfield force-pushed the james/proximity-sampling branch from 0de072d to 32b0112 Compare March 11, 2026 21:10

victorlin reviewed Mar 12, 2026

View reviewed changes

Comment thread augur/proximity.py Outdated

victorlin reviewed Mar 13, 2026

View reviewed changes

Comment thread augur/subsample.py Outdated

victorlin reviewed Mar 13, 2026

View reviewed changes

Comment thread augur/subsample.py Outdated

Comment thread augur/subsample.py Outdated

victorlin reviewed Mar 13, 2026

View reviewed changes

Comment thread augur/subsample.py Outdated

jameshadfield force-pushed the james/proximity-sampling branch from 32b0112 to 0838475 Compare March 15, 2026 23:25

jameshadfield pushed a commit that referenced this pull request Mar 15, 2026

Run samples with scheduler loop

f6d1b4a

Requested in code review <#1962 (comment)> "I find the pull method easier to understand/debug with less shared state and no callbacks"

jameshadfield force-pushed the james/proximity-sampling branch from 0838475 to 6903412 Compare March 15, 2026 23:45

victorlin reviewed Mar 16, 2026

View reviewed changes

Comment thread devel/regenerate-subsample-schema Outdated

Comment thread augur/proximity.py

Comment thread augur/subsample.py Outdated

Comment thread augur/subsample.py Outdated

Comment thread augur/subsample.py Outdated

joverlee521 reviewed Mar 17, 2026

View reviewed changes

Comment thread docs/usage/cli/proximity.rst Outdated

Comment thread docs/usage/cli/proximity.rst

jameshadfield pushed a commit that referenced this pull request Mar 22, 2026

Run samples with scheduler loop

686245d

Requested in code review <#1962 (comment)> "I find the pull method easier to understand/debug with less shared state and no callbacks"

jameshadfield added a commit that referenced this pull request Mar 22, 2026

[subsample] use print_debug

a84a3c5

Review suggestion: Debug messages can be simplified with augur.io.print.print_debug <#1962 (comment)>

jameshadfield force-pushed the james/proximity-sampling branch from 6903412 to d79b8ca Compare March 22, 2026 22:57

joverlee521 approved these changes Mar 25, 2026

View reviewed changes

Comment thread docs/usage/cli/proximity.rst

jameshadfield added 3 commits April 14, 2026 20:32

jameshadfield and others added 5 commits April 14, 2026 20:32

Run samples with scheduler loop

fa98eeb

Requested in code review <#1962 (comment)> "I find the pull method easier to understand/debug with less shared state and no callbacks"

[subsample] use print_debug

eed2d44

Review suggestion: Debug messages can be simplified with augur.io.print.print_debug <#1962 (comment)>

[subsample] snakemake helper func

d8098b2

so that workflows can know whether aligned sequences must be provided. This is to let us prototype workflows where we conditionally align inputs based on the contents of the (customisable) subsampling configs.

Docs

8d32d1d

jameshadfield force-pushed the james/proximity-sampling branch from d79b8ca to 8d32d1d Compare April 14, 2026 08:35

jameshadfield mentioned this pull request Apr 14, 2026

Augur proximity follow-ups #1985

Open

jameshadfield merged commit 5f616ee into master Apr 14, 2026
35 checks passed

jameshadfield deleted the james/proximity-sampling branch April 14, 2026 09:01

jameshadfield mentioned this pull request Apr 14, 2026

allow protein analyses #1958

Merged

4 tasks

joverlee521 mentioned this pull request Apr 16, 2026

phylo: Implement proximity subsampling nextstrain/measles#121

Open

Conversation

jameshadfield commented Feb 15, 2026

Performance

Checklist

Uh oh!

Uh oh!

victorlin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

victorlin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jameshadfield commented Mar 16, 2026

Uh oh!

victorlin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

joverlee521 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

joverlee521 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented Mar 11, 2026 •

edited

Loading