Skip to content

augur proximity#1962

Merged
jameshadfield merged 8 commits intomasterfrom
james/proximity-sampling
Apr 14, 2026
Merged

augur proximity#1962
jameshadfield merged 8 commits intomasterfrom
james/proximity-sampling

Conversation

@jameshadfield
Copy link
Copy Markdown
Member

A commonly requested use-case is finding genetically-similar strains for a small target (query) set. We used this extensively in the early years of COVID, and nextstrain/ncov#816 explains the desire to shift this to a generalised augur command.

See comments at the top of the python file for more details about the algorithm, as well as future directions.

This is designed to fit into our regular snakemake workflows as follows:

  • augur filter creates your query set and potentially also a contextual set if you know filters which will reduce this (e.g. clade, subtype, temporal range etc)
  • augur proximity creates a strains text file
  • augur filter consumes this strains file to produce the set of proximal sequences
  • augur merge combines the samples together for analysis

@victorlin how would you see this best fitting into augur subsample?

Performance

I used a query set of either n=1,000 or n=100 H5N1 PB2 sequences from North America since 2020. The context set was n=343,545, which was all influenza PB2 sequences which nextclade would align to a H5N1 reference under default settings. Sequences were xz compressed.

runtimes

This allows us to get proximal sequences in a couple of minutes which is well suited for all our analysis workflows except for ncov. (For ncov we can probably subsample the data to get a suitable set of contextual sequences using pango lineages etc.)

Checklist

  • Automated checks pass
  • Check if you need to add a changelog message
  • Check if you need to add tests
  • Check if you need to update docs

Comment thread augur/proximity.py Outdated
@jameshadfield jameshadfield force-pushed the james/proximity-sampling branch 3 times, most recently from d9cf947 to 443aa7d Compare February 26, 2026 03:21
Copy link
Copy Markdown
Member

@victorlin victorlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience! Threads below for 2 large refactoring suggestions and 2 nitpicks.

Comment thread augur/subsample.py Outdated
Comment thread augur/subsample.py Outdated
Comment thread augur/subsample.py Outdated
Comment thread augur/subsample.py Outdated
jameshadfield added a commit that referenced this pull request Mar 11, 2026
Extends the subsampling schema to allow proximity sampling using the
(new) `augur proximity` command. Currently `augur proximity` sampling
runs using a single thread, which is non-optional, but a subsequent
commit will change this.

The added zika alignment was computed using the default alignment settings
from our zika repo

This commit includes contributions from @victorlin as suggested during review,
especially <#1962 (comment)>
@jameshadfield jameshadfield force-pushed the james/proximity-sampling branch from ef3f36d to 0de072d Compare March 11, 2026 03:18
jameshadfield added a commit that referenced this pull request Mar 11, 2026
Extends the subsampling schema to allow proximity sampling using the
(new) `augur proximity` command. Currently `augur proximity` sampling
runs using a single thread, which is non-optional, but a subsequent
commit will change this.

The added zika alignment was computed using the default alignment settings
from our zika repo

This commit includes contributions from @victorlin as suggested during review,
especially <#1962 (comment)>
@jameshadfield jameshadfield force-pushed the james/proximity-sampling branch from 0de072d to 32b0112 Compare March 11, 2026 21:10
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 11, 2026

Codecov Report

❌ Patch coverage is 86.30631% with 76 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.06%. Comparing base (b6f862f) to head (8d32d1d).
⚠️ Report is 9 commits behind head on master.

Files with missing lines Patch % Lines
augur/proximity.py 80.56% 29 Missing and 12 partials ⚠️
augur/subsample.py 89.82% 17 Missing and 18 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1962      +/-   ##
==========================================
+ Coverage   74.51%   75.06%   +0.55%     
==========================================
  Files          82       83       +1     
  Lines        9204     9708     +504     
  Branches     1870     1969      +99     
==========================================
+ Hits         6858     7287     +429     
- Misses       2038     2084      +46     
- Partials      308      337      +29     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread augur/proximity.py Outdated
Comment thread augur/subsample.py Outdated
Copy link
Copy Markdown
Member

@victorlin victorlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still haven't looked at the hamming distance code, but this should be it for comments on subsample changes.

Comment thread augur/subsample.py Outdated
Comment thread augur/subsample.py Outdated
Comment thread augur/subsample.py Outdated
jameshadfield added a commit that referenced this pull request Mar 15, 2026
Extends the subsampling schema to allow proximity sampling using the
(new) `augur proximity` command. Currently `augur proximity` sampling
runs using a single thread, which is non-optional, but a subsequent
commit will change this.

The added zika alignment was computed using the default alignment settings
from our zika repo

This commit includes contributions from @victorlin as suggested during review,
especially <#1962 (comment)>
@jameshadfield jameshadfield force-pushed the james/proximity-sampling branch from 32b0112 to 0838475 Compare March 15, 2026 23:25
jameshadfield pushed a commit that referenced this pull request Mar 15, 2026
Requested in code review <#1962 (comment)>

"I find the pull method easier to understand/debug with less shared state and no callbacks"
@jameshadfield jameshadfield force-pushed the james/proximity-sampling branch from 0838475 to 6903412 Compare March 15, 2026 23:45
@jameshadfield
Copy link
Copy Markdown
Member Author

CI failures addressed by #1973

Copy link
Copy Markdown
Member

@victorlin victorlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more subsample-related comments and suggestions

Comment thread devel/regenerate-subsample-schema Outdated
Comment thread augur/proximity.py
Comment thread augur/subsample.py Outdated
Comment thread augur/subsample.py Outdated
Comment thread augur/subsample.py Outdated
Copy link
Copy Markdown
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still catching up, just did a first pass through.

Comment thread docs/usage/cli/proximity.rst Outdated
Comment thread docs/usage/cli/proximity.rst
jameshadfield added a commit that referenced this pull request Mar 22, 2026
Extends the subsampling schema to allow proximity sampling using the
(new) `augur proximity` command. Currently `augur proximity` sampling
runs using a single thread, which is non-optional, but a subsequent
commit will change this.

The added zika alignment was computed using the default alignment settings
from our zika repo

This commit includes contributions from @victorlin as suggested during review,
especially <#1962 (comment)>
jameshadfield pushed a commit that referenced this pull request Mar 22, 2026
Requested in code review <#1962 (comment)>

"I find the pull method easier to understand/debug with less shared state and no callbacks"
jameshadfield added a commit that referenced this pull request Mar 22, 2026
Review suggestion: Debug messages can be simplified with augur.io.print.print_debug
<#1962 (comment)>
@jameshadfield jameshadfield force-pushed the james/proximity-sampling branch from 6903412 to d79b8ca Compare March 22, 2026 22:57
Copy link
Copy Markdown
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it's in a good place to test out in some pathogen workflows. Any issues we encounter there can be added as future changes.

Comment thread docs/usage/cli/proximity.rst
A commonly requested use-case is finding genetically-similar strains for
a small target (query) set. We used this extensively in the early years
of COVID, and <nextstrain/ncov#816> explains
the desire to shift this to a generalised augur command.

This is designed to fit into our regular snakemake workflows as follows:
* `augur filter` creates your query set and potentially also a
  contextual set if you know filters which will reduce this (e.g. clade,
  subtype, temporal range etc)
* `augur proximity` creates a strains text file
* `augur filter` consumes this strains file to produce the set of
  proximal sequences
* `augur merge` combines the samples together for analysis

See comments at the top of the python file for more details about the
algorithm, as well as future directions.

This uses a lot of the tricks used by
<https://github.com/nextstrain/ncov/blob/master/scripts/get_distance_to_focal_set.py>
howerver here our focus is on comparing queries directly to the
contextual sequences without needing a reference sequence as a
comparitor.

An important caveat is that this approach loads all contextual sequences
into memory (using numpy arrays, so 1 byte per character). This makes it
unsuitable for ncov, but it should work for all our other pathogens. See
comments in the file for ways to improve this memory bottlenech.

Claude code helped with this commit.
This expands the concept of subsampling to allow a sample to use another
sample as its inputs. Internally this necessitates a DAG for the samples
and this a more complex invocation of parallalism. We add two extra
config keys, 'drop_sample' and 'target_sample', which are different from
the other config parameters in that they don't map to `augur filter`
arguments directly.

The usefulness of hierarchical sampling as implemented here is
debateable, and while there are examples (e.g. RSV) it's probably not
worth the added complexity to `augur subsample`. However the next commit
will add proximal subsampling and that needs this functionality, so it
makes sense to first implement it for "normal" subsampling.

Claude Opus 4.6 used for lots of the code here, but I refactored /
commented / changed / added code throughout.
Extends the subsampling schema to allow proximity sampling using the
(new) `augur proximity` command. Currently `augur proximity` sampling
runs using a single thread, which is non-optional, but a subsequent
commit will change this.

The added zika alignment was computed using the default alignment settings
from our zika repo

This commit includes contributions from @victorlin as suggested during review,
especially <#1962 (comment)>
jameshadfield and others added 5 commits April 14, 2026 20:32
Proximity is very parallalisable, and it's likely that proximity
sampling steps will be the most computationally expensive part of
subsampling schemes. Thus we want to run with multiple threads. To do so
requires a more complex design for our concurrency model as we can no
longer simply add jobs to the thread pool and let it manage when they
actually run. We add a second layer of manual resource (thread)
management so that we can run samples with varying resource (thread)
requirements.
Requested in code review <#1962 (comment)>

"I find the pull method easier to understand/debug with less shared state and no callbacks"
Review suggestion: Debug messages can be simplified with augur.io.print.print_debug
<#1962 (comment)>
so that workflows can know whether aligned sequences must be provided.
This is to let us prototype workflows where we conditionally align
inputs based on the contents of the (customisable) subsampling configs.
@jameshadfield jameshadfield force-pushed the james/proximity-sampling branch from d79b8ca to 8d32d1d Compare April 14, 2026 08:35
@jameshadfield jameshadfield merged commit 5f616ee into master Apr 14, 2026
35 checks passed
@jameshadfield jameshadfield deleted the james/proximity-sampling branch April 14, 2026 09:01
@jameshadfield jameshadfield mentioned this pull request Apr 14, 2026
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants