Conversation
d9cf947 to
443aa7d
Compare
victorlin
left a comment
There was a problem hiding this comment.
Thanks for your patience! Threads below for 2 large refactoring suggestions and 2 nitpicks.
Extends the subsampling schema to allow proximity sampling using the (new) `augur proximity` command. Currently `augur proximity` sampling runs using a single thread, which is non-optional, but a subsequent commit will change this. The added zika alignment was computed using the default alignment settings from our zika repo This commit includes contributions from @victorlin as suggested during review, especially <#1962 (comment)>
ef3f36d to
0de072d
Compare
Extends the subsampling schema to allow proximity sampling using the (new) `augur proximity` command. Currently `augur proximity` sampling runs using a single thread, which is non-optional, but a subsequent commit will change this. The added zika alignment was computed using the default alignment settings from our zika repo This commit includes contributions from @victorlin as suggested during review, especially <#1962 (comment)>
0de072d to
32b0112
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1962 +/- ##
==========================================
+ Coverage 74.51% 75.06% +0.55%
==========================================
Files 82 83 +1
Lines 9204 9708 +504
Branches 1870 1969 +99
==========================================
+ Hits 6858 7287 +429
- Misses 2038 2084 +46
- Partials 308 337 +29 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
victorlin
left a comment
There was a problem hiding this comment.
I still haven't looked at the hamming distance code, but this should be it for comments on subsample changes.
Extends the subsampling schema to allow proximity sampling using the (new) `augur proximity` command. Currently `augur proximity` sampling runs using a single thread, which is non-optional, but a subsequent commit will change this. The added zika alignment was computed using the default alignment settings from our zika repo This commit includes contributions from @victorlin as suggested during review, especially <#1962 (comment)>
32b0112 to
0838475
Compare
Requested in code review <#1962 (comment)> "I find the pull method easier to understand/debug with less shared state and no callbacks"
0838475 to
6903412
Compare
|
CI failures addressed by #1973 |
victorlin
left a comment
There was a problem hiding this comment.
Some more subsample-related comments and suggestions
joverlee521
left a comment
There was a problem hiding this comment.
Still catching up, just did a first pass through.
Extends the subsampling schema to allow proximity sampling using the (new) `augur proximity` command. Currently `augur proximity` sampling runs using a single thread, which is non-optional, but a subsequent commit will change this. The added zika alignment was computed using the default alignment settings from our zika repo This commit includes contributions from @victorlin as suggested during review, especially <#1962 (comment)>
Requested in code review <#1962 (comment)> "I find the pull method easier to understand/debug with less shared state and no callbacks"
Review suggestion: Debug messages can be simplified with augur.io.print.print_debug <#1962 (comment)>
6903412 to
d79b8ca
Compare
joverlee521
left a comment
There was a problem hiding this comment.
This looks like it's in a good place to test out in some pathogen workflows. Any issues we encounter there can be added as future changes.
A commonly requested use-case is finding genetically-similar strains for a small target (query) set. We used this extensively in the early years of COVID, and <nextstrain/ncov#816> explains the desire to shift this to a generalised augur command. This is designed to fit into our regular snakemake workflows as follows: * `augur filter` creates your query set and potentially also a contextual set if you know filters which will reduce this (e.g. clade, subtype, temporal range etc) * `augur proximity` creates a strains text file * `augur filter` consumes this strains file to produce the set of proximal sequences * `augur merge` combines the samples together for analysis See comments at the top of the python file for more details about the algorithm, as well as future directions. This uses a lot of the tricks used by <https://github.com/nextstrain/ncov/blob/master/scripts/get_distance_to_focal_set.py> howerver here our focus is on comparing queries directly to the contextual sequences without needing a reference sequence as a comparitor. An important caveat is that this approach loads all contextual sequences into memory (using numpy arrays, so 1 byte per character). This makes it unsuitable for ncov, but it should work for all our other pathogens. See comments in the file for ways to improve this memory bottlenech. Claude code helped with this commit.
This expands the concept of subsampling to allow a sample to use another sample as its inputs. Internally this necessitates a DAG for the samples and this a more complex invocation of parallalism. We add two extra config keys, 'drop_sample' and 'target_sample', which are different from the other config parameters in that they don't map to `augur filter` arguments directly. The usefulness of hierarchical sampling as implemented here is debateable, and while there are examples (e.g. RSV) it's probably not worth the added complexity to `augur subsample`. However the next commit will add proximal subsampling and that needs this functionality, so it makes sense to first implement it for "normal" subsampling. Claude Opus 4.6 used for lots of the code here, but I refactored / commented / changed / added code throughout.
Extends the subsampling schema to allow proximity sampling using the (new) `augur proximity` command. Currently `augur proximity` sampling runs using a single thread, which is non-optional, but a subsequent commit will change this. The added zika alignment was computed using the default alignment settings from our zika repo This commit includes contributions from @victorlin as suggested during review, especially <#1962 (comment)>
Proximity is very parallalisable, and it's likely that proximity sampling steps will be the most computationally expensive part of subsampling schemes. Thus we want to run with multiple threads. To do so requires a more complex design for our concurrency model as we can no longer simply add jobs to the thread pool and let it manage when they actually run. We add a second layer of manual resource (thread) management so that we can run samples with varying resource (thread) requirements.
Requested in code review <#1962 (comment)> "I find the pull method easier to understand/debug with less shared state and no callbacks"
Review suggestion: Debug messages can be simplified with augur.io.print.print_debug <#1962 (comment)>
so that workflows can know whether aligned sequences must be provided. This is to let us prototype workflows where we conditionally align inputs based on the contents of the (customisable) subsampling configs.
d79b8ca to
8d32d1d
Compare
A commonly requested use-case is finding genetically-similar strains for a small target (query) set. We used this extensively in the early years of COVID, and nextstrain/ncov#816 explains the desire to shift this to a generalised augur command.
See comments at the top of the python file for more details about the algorithm, as well as future directions.
This is designed to fit into our regular snakemake workflows as follows:
augur filtercreates your query set and potentially also a contextual set if you know filters which will reduce this (e.g. clade, subtype, temporal range etc)augur proximitycreates a strains text fileaugur filterconsumes this strains file to produce the set of proximal sequencesaugur mergecombines the samples together for analysis@victorlin how would you see this best fitting into
augur subsample?Performance
I used a query set of either n=1,000 or n=100 H5N1 PB2 sequences from North America since 2020. The context set was n=343,545, which was all influenza PB2 sequences which nextclade would align to a H5N1 reference under default settings. Sequences were xz compressed.
This allows us to get proximal sequences in a couple of minutes which is well suited for all our analysis workflows except for ncov. (For ncov we can probably subsample the data to get a suitable set of contextual sequences using pango lineages etc.)
Checklist