Threading the VCF annotator#534
Conversation
|
👍 this is great. I think I'd like to see the output sorted, although it might be worth checking if it's faster/easier to just run the output through |
|
@david4096 Hey! Good to hear from you. We did some integration a while ago and noticed the same thing - threading helps. We added threading to our wrapper. I'm curious what parameters (# of threads etc) you used and how much it helped? https://docs.google.com/presentation/d/1YUTGW3CaXimUE44aMEe9DpP1qN_mESoL7guqDXTweYo/edit#slide=id.p |
|
Hi @bwalsh !! I put in a cursory benchmark in the above issue. The code here uses your CPU count, which for me was 12. It took a little less than half the time (which I'm sure could be improved upon). |
| if output_vcf_path and vcf_out: | ||
| for k in additional_info_fields: | ||
| record.info[k.value] = [ | ||
| value or k.default_value() for value in vrs_field_data[k.value] |
There was a problem hiding this comment.
This part I wasn't sure about
|
Closing for now -- we are investigating a few different VRS-Python-wide performance speedups but have bookmarked this in #266 |
A modest performance improvement at the cost of sorting. Could be improved to keep sorting by using chunking.
Includes commits from #533
performance on small VCF
OLD method
NEW method
Picking out a line to compare that things look the same
performance on larger VCF
NEW method
OLD method