Threading the VCF annotator by david4096 · Pull Request #534 · ga4gh/vrs-python

david4096 · 2025-03-31T21:09:39Z

A modest performance improvement at the cost of sorting. Could be improved to keep sorting by using chunking.

Includes commits from #533

performance on small VCF

OLD method

(env2) ➜  vrs-python git:(feature/annotator) ✗ time vrs-annotate vcf --vcf-out test-old.vcf NA12878-chr14-AKT1.vcf.gz
Annotating NA12878-chr14-AKT1.vcf.gz with the VCF Annotator...
VCF Annotator finished in 10.25358 seconds
vrs-annotate vcf --vcf-out test-old.vcf NA12878-chr14-AKT1.vcf.gz  7.88s user 2.59s system 98% cpu 10.613 total

NEW method

(env3) ➜  vrs-python git:(feature/thread-ann) ✗ time vrs-annotate vcf --vcf-out test-new.vcf NA12878-chr14-AKT1.vcf.gz
Annotating NA12878-chr14-AKT1.vcf.gz with the VCF Annotator...
VCF Annotator finished in 3.57795 seconds
vrs-annotate vcf --vcf-out test-new.vcf NA12878-chr14-AKT1.vcf.gz  10.65s user 8.09s system 488% cpu 3.835 total

Picking out a line to compare that things look the same

< chr14	106779713	.	G	A	50	PASS	AC=1;AF=0.5;AN=2;DP=34;FS=3.133;MQ=250;MQRankSum=4.697;QD=1.47;ReadPosRankSum=2.451;SOR=0.313;FractionInformativeReads=0.971;R2_5P_bias=13.461;VRS_Allele_IDs=ga4gh:VA.R0Y_drBrtNKY97AgFMoOY4XN5SQHOKg2,ga4gh:VA.vpg7ue7_gkI1EL39jv88tdC9V35WqclM	GT:AD:AF:DP:F1R2:F2R1:GQ:PL:GP:PRI:SB:MB	0/1:12,21:0.636:33:7,9:5,12:46:85,0,45:50,0.00011882,47.602:0,34.77,37.77:8,4,11,10:7,5,11,10
> chr14	106779713	.	G	A	50	PASS	AC=1;AF=0.5;AN=2;DP=34;FS=3.133;MQ=250;MQRankSum=4.697;QD=1.47;ReadPosRankSum=2.451;SOR=0.313;FractionInformativeReads=0.971;R2_5P_bias=13.461;VRS_Allele_IDs=ga4gh:VA.R0Y_drBrtNKY97AgFMoOY4XN5SQHOKg2,ga4gh:VA.vpg7ue7_gkI1EL39jv88tdC9V35WqclM	GT:AD:AF:DP:F1R2:F2R1:GQ:PL:GP:PRI:SB:MB	0/1:12,21:0.636:33:7,9:5,12:46:85,0,45:50,0.00011882,47.602:0,34.77,37.77:8,4,11,10:7,5,11,10

performance on larger VCF

NEW method

(env3) ➜  vrs-python git:(feature/thread-ann) ✗ time vrs-annotate vcf --vcf-out test-new2.vcf data/ALL.chrX.BI_Beagle.20100804.sites.vcf.gz 
Annotating data/ALL.chrX.BI_Beagle.20100804.sites.vcf.gz with the VCF Annotator...
[W::bcf_hdr_check_sanity] PL should be declared as Number=G
VCF Annotator finished in 213.16505 seconds
vrs-annotate vcf --vcf-out test-new2.vcf   344.08s user 242.52s system 274% cpu 3:33.58 total

OLD method

(env2) ➜  vrs-python git:(feature/annotator) ✗ time vrs-annotate vcf --vcf-out test-old2.vcf data/ALL.chrX.BI_Beagle.20100804.sites.vcf.gz

Annotating data/ALL.chrX.BI_Beagle.20100804.sites.vcf.gz with the VCF Annotator...
[W::bcf_hdr_check_sanity] PL should be declared as Number=G
VCF Annotator finished in 365.42381 seconds
vrs-annotate vcf --vcf-out test-old2.vcf   285.39s user 77.96s system 99% cpu 6:05.75 total

jsstevenson · 2025-04-01T14:31:02Z

👍 this is great. I think I'd like to see the output sorted, although it might be worth checking if it's faster/easier to just run the output through bcftools sort.

bwalsh · 2025-04-01T16:43:11Z

@david4096 Hey! Good to hear from you.

We did some integration a while ago and noticed the same thing - threading helps. We added threading to our wrapper. I'm curious what parameters (# of threads etc) you used and how much it helped?

https://docs.google.com/presentation/d/1YUTGW3CaXimUE44aMEe9DpP1qN_mESoL7guqDXTweYo/edit#slide=id.p
https://docs.google.com/presentation/d/1hk-c2T2w6X2sh5Dlwqzxi9kjR-c91R2_snyh1q0XCLI/edit#slide=id.p

david4096 · 2025-04-01T18:28:29Z

Hi @bwalsh !! I put in a cursory benchmark in the above issue. The code here uses your CPU count, which for me was 12. It took a little less than half the time (which I'm sure could be improved upon).

david4096 · 2025-04-02T18:17:25Z

            if output_vcf_path and vcf_out:
-                for k in additional_info_fields:
-                    record.info[k.value] = [
-                        value or k.default_value() for value in vrs_field_data[k.value]


This part I wasn't sure about

@quinnwai FYI - can you take a look?

jsstevenson · 2026-02-06T18:17:38Z

Closing for now -- we are investigating a few different VRS-Python-wide performance speedups but have bookmarked this in #266

david4096 added 3 commits March 31, 2025 15:06

Add pysam to extras

db19edf

Add setuptools

08d455e

Thread the VCF annotator

322a330

david4096 commented Apr 2, 2025

View reviewed changes

jsstevenson mentioned this pull request Feb 6, 2026

Improve performance for VCFAnnotator #266

Open

jsstevenson closed this Feb 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Threading the VCF annotator#534

Threading the VCF annotator#534
david4096 wants to merge 3 commits intoga4gh:mainfrom
david4096:feature/thread-ann

david4096 commented Mar 31, 2025

Uh oh!

jsstevenson commented Apr 1, 2025

Uh oh!

bwalsh commented Apr 1, 2025

Uh oh!

david4096 commented Apr 1, 2025

Uh oh!

david4096 Apr 2, 2025

Uh oh!

bwalsh Apr 2, 2025

Uh oh!

jsstevenson commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

david4096 commented Mar 31, 2025

performance on small VCF

performance on larger VCF

Uh oh!

jsstevenson commented Apr 1, 2025

Uh oh!

bwalsh commented Apr 1, 2025

Uh oh!

david4096 commented Apr 1, 2025

Uh oh!

david4096 Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

bwalsh Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

jsstevenson commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants