Hi all, I'm trying to figure out what non-default breseq settings might give us lower rates of false positive mutations in polymorphism mode, and wondering if there's any guidance for which settings are likely to be the most helpful?
I'm working with short-read sequences of experimentally evolved communities of P. aeruginosa with a few other species, although we're primarily interested in evolution in P. aeruginosa. For references, I'm using a P. aeruginosa genome assembled de novo from long read sequencing of our ancestor with Autocycler then annotated with bakta, and ancestral genomes of the other three species directly from ATCC or assembled de novo.
We have some mutations that are clear true positives (e.g. are in genes we expect to be under selection, are at very high frequencies, etc). However, I'm seeing a ton of mutations at low to mid frequencies, many more than I might expect, and I'm wondering how best to go about tightening any filters to see if these hits go away and were just false positives.
I looked around for tips and saw that in Tutorial: Populations, it's noted that predicting polymorphisms is very prone to false-positives and "breseq often needs some tuning of parameters and statistical cutoffs depending on characteristics of the input data set in order to not predict either too many (false-positives) or too few (false-negatives) polymorphisms. In addition, it may be necessary to perform more complex analyses of multiple samples or of time courses to gain extra power for discriminating true polymorphisms from errors". However, I couldn't find anything more concrete about which parameters and cutoffs are the best to try.
Basically, I'm wondering which of the breseq settings are the best arguments to try varying in order to reduce false positives? I've looked at the arguments listed under Read Alignment, Bowtie2 Mapping/Alignment, and Polymorphism Read Alignment Evidence, but there's so many it's a bit difficult to know where to start/which arguments are the most likely to reduce false positives without eliminating true positives.
For a little bit about what my data looks like, I had tried simply filtering based on frequency post-hoc, but it hasn't been sufficient because the putative false positives can be at frequencies above 30%, sometimes much higher than the putative true positive mutations. For instance, here's a sample that has been experimentally evolving for just 24 hours (you can ignore the mutations to seq_id's for 6a49b35b19134f7e_, AX1_contig_, PP203295, and e82cc082e40344a9_* because those are the references for the other species where we generally have very low coverage):
index.html
You can see that there are a huge number of mutations, often (but not always) low-ish frequency. One of the mutations (in PilS) is almost certainly a true positive (high frequency, in a gene we expect to see mutations):
DEL_410.html
However there are other mutations, including at pretty high frequencies, that I'm a little suspicious of (31% frequency for this one):
SNP_497.html
Indeed, breseq calls 6 tightly-clustered mutations in that exact region, relying on many of the same reads:
Potentially it seems like a lot of the putative false positive mutations are tightly clustered together and relying on the same reads that have many bases mismatched to the reference, but I'm not sure how to best go about limiting those coming through. Any advice appreciated!
Hi all, I'm trying to figure out what non-default breseq settings might give us lower rates of false positive mutations in polymorphism mode, and wondering if there's any guidance for which settings are likely to be the most helpful?
I'm working with short-read sequences of experimentally evolved communities of P. aeruginosa with a few other species, although we're primarily interested in evolution in P. aeruginosa. For references, I'm using a P. aeruginosa genome assembled de novo from long read sequencing of our ancestor with Autocycler then annotated with bakta, and ancestral genomes of the other three species directly from ATCC or assembled de novo.
We have some mutations that are clear true positives (e.g. are in genes we expect to be under selection, are at very high frequencies, etc). However, I'm seeing a ton of mutations at low to mid frequencies, many more than I might expect, and I'm wondering how best to go about tightening any filters to see if these hits go away and were just false positives.
I looked around for tips and saw that in Tutorial: Populations, it's noted that predicting polymorphisms is very prone to false-positives and "breseq often needs some tuning of parameters and statistical cutoffs depending on characteristics of the input data set in order to not predict either too many (false-positives) or too few (false-negatives) polymorphisms. In addition, it may be necessary to perform more complex analyses of multiple samples or of time courses to gain extra power for discriminating true polymorphisms from errors". However, I couldn't find anything more concrete about which parameters and cutoffs are the best to try.
Basically, I'm wondering which of the breseq settings are the best arguments to try varying in order to reduce false positives? I've looked at the arguments listed under Read Alignment, Bowtie2 Mapping/Alignment, and Polymorphism Read Alignment Evidence, but there's so many it's a bit difficult to know where to start/which arguments are the most likely to reduce false positives without eliminating true positives.
For a little bit about what my data looks like, I had tried simply filtering based on frequency post-hoc, but it hasn't been sufficient because the putative false positives can be at frequencies above 30%, sometimes much higher than the putative true positive mutations. For instance, here's a sample that has been experimentally evolving for just 24 hours (you can ignore the mutations to seq_id's for 6a49b35b19134f7e_, AX1_contig_, PP203295, and e82cc082e40344a9_* because those are the references for the other species where we generally have very low coverage):
index.html
You can see that there are a huge number of mutations, often (but not always) low-ish frequency. One of the mutations (in PilS) is almost certainly a true positive (high frequency, in a gene we expect to see mutations):
DEL_410.html
However there are other mutations, including at pretty high frequencies, that I'm a little suspicious of (31% frequency for this one):
SNP_497.html
Indeed, breseq calls 6 tightly-clustered mutations in that exact region, relying on many of the same reads:
Potentially it seems like a lot of the putative false positive mutations are tightly clustered together and relying on the same reads that have many bases mismatched to the reference, but I'm not sure how to best go about limiting those coming through. Any advice appreciated!