Did a test run of 100kb of vcf with 98 genomes with -polar 0.9. I scanned the tree file (actually codex did) for 0 and 1 counts at each snp and compared to biallelic vcf, expecting either identical counts or complementary counts. But observed about 3% of sites showing something like 162 1's in the vcf and 153 in the tree file. I asked codex to check into this, focusing specifically at pos 5039, a position with 4 1's in the vcf file and 196 1's in the tree file.
Here is the explanation from codex: " - At chr2L:5039 (relative 233), the tree has two top-level mutations (parent = -1):
- root node 2955: derived 1
- descendant node 1943: derived 0
- With both marked top-level, tskit applies them in table order; the root 1 overwrites everything,
yielding all 1’s in genotypes.
- The 0 on node 1943 is a back-mutation and must be a child of the root 1 mutation (its parent should
be the 1-mutation’s ID). Then tskit would produce 192 ones and 4 zeros, matching the VCF. "
So it looks like I can still work with this, but though it worth mentioning.
Did a test run of 100kb of vcf with 98 genomes with -polar 0.9. I scanned the tree file (actually codex did) for 0 and 1 counts at each snp and compared to biallelic vcf, expecting either identical counts or complementary counts. But observed about 3% of sites showing something like 162 1's in the vcf and 153 in the tree file. I asked codex to check into this, focusing specifically at pos 5039, a position with 4 1's in the vcf file and 196 1's in the tree file.
Here is the explanation from codex: " - At chr2L:5039 (relative 233), the tree has two top-level mutations (parent = -1):
- root node 2955: derived 1
- descendant node 1943: derived 0
yielding all 1’s in genotypes.
be the 1-mutation’s ID). Then tskit would produce 192 ones and 4 zeros, matching the VCF. "
So it looks like I can still work with this, but though it worth mentioning.