-
Notifications
You must be signed in to change notification settings - Fork 20
Description
This is going to be a long post. I have included code details to replicate the problem, at the end.
I'm using sg_trace_striped_16 on two nearly identical strings of length ≈51Kbp. When I follow the cigar string to convert the alignment to the kind of pair of strings you'd see in e.g. UCSC MAF format, I notice a strange thing at the right end of the alignment. It has broken what ought to be a long match run by repeatedly inserting and deleting a single base. Below I show what part of that looks like. The three dots to the left represent many thousand bases that seem to be correctly aligned, with a few mismatches or indels where they should be. The three dots to the right represent many thousand bases that exhibit a similar pattern of delete-one-insert-one-match-8-or-so (I also see it at the end of the alignment, but I'm not sure if it exists throughout in between). Note that each deleted/inserted base is a match, not a mismatch, so there seems to be no mathematical reason to delete/insert.
reference: ...CTCATTTACCTTTCA-GTAAAATTG-TTAAATAAG-CAAAATAA-TTTGAGTT-AAAATTAGAAT-AAAAATTGT-CTTTTATTT-GGA...
query: ...CTCATTTACCTTTCAG-TAAAATTGT-TAAATAAGC-AAAATAATT-TGAGTTAAA-ATTAGAATA-AAAATTGTC-TTTTATTTG-GA...
I've confirmed that the problem is not in the conversion of cigar to text. If I examine the cigar for this alignment, I see
[(2890,'='),(1,'I'),
(12918,'='),(1,'X'),
(5026,'='),(1,'X'),
(2537,'='),(1,'X'),
(17,'='),(1,'X'),
(9395,'='),(1,'D'),(1,'I'),(8,'='),(1,'D'),(1,'I'),(8,'='),(1,'D'),(1,'I'),(7,'=') ...]
Moreover, if I remove that first 2890 bp match and single indel from my sequences and run again (i.e. remove 2891 from the reference and 2890 from the query), I get a cigar that agrees with the rest of that original cigar for a while, then diverges. And it, too, appears to eventually devolves into the delete-one-insert-one-match-8-or-so pattern. Mathematically, I would expect this alignment to be identical to the previous, except for the removal of those first two cigar ops.
[(12918, '='), (1, 'X'), (5026, '='), (1, 'X'), (2537, '='), (1, 'X'), (17, '='), (1, 'X'), (11628, '='), (2, 'D'), (656, '='), (1, 'D'), (2, '='), (1, 'I'), (8, '='), (1, 'D'), (1, 'I'), (8, '='), (1, 'D'), (1, 'I') ...]
I'm using (or at least I think I am using) a simple scoring model rewarding 1 for a match, penalizing 3 for a mismatch, 4 for a gap open and 1 for a gap extend. Aligning these two sequences with the same (I hope) scoring model in lastz yields an alignment covering the entirety of the strings, with 5 mismatches and 5 gapped bases.
The two input sequences are available at https://psu.box.com/s/lhwm1zaag8qt0l0ypu1ksfh4ueus1z7x
Details:
import parasail
subsMatrix = parasail.matrix_create("ACGT",1,-3)
refDB = parasail.sequences_from_file("example.chm13.fa")
qryDB = parasail.sequences_from_file("example.hg38.fa")
# extract the nt sequence from ref and qry here
ref = refDB[0]
qry = qryDB[0]
a = parasail.sg_trace_striped_16(ref.seq,qry.seq,4,1,subsMatrix)
aCigar = list(map(lambda v:(a.cigar.decode_len(v),a.cigar.decode_op(v).decode("utf-8")),a.cigar.seq))
len(aCigar) # 6669
aCigar[:20] # this is the first cigar list I show above
a2 = parasail.sg_trace_striped_16(refNucs[2891:], qryNucs[2890:],4,1,subsMatrix)
s2Cigar = list(map(lambda v:(a2.cigar.decode_len(v),a2.cigar.decode_op(v).decode("utf-8")),a2.cigar.seq))
len(a2Cigar) # 5628 I'd really expect it to just be two shorter than 6669
a2Cigar[:20] # this is the second cigar list I show aboveI also tried parasail.sg_trace_striped_32, thinking maybe some internal scoring was overflowing the 16-bit ints. But I got the same strange result. (I don't know if your implementation is using an overflow-protected recurrence like that in the Suzuki 2017 paper).
Not sure whether I am doing something wrong or not. Perhaps there's a limit to the size of strings allowed? That the pattern seems to repeat at 8 makes me wonder if it has to do with striping.