Skip to content

Repeatedly choosing back-to-back single base indels when a match is expected #59

@rsharris

Description

@rsharris

This is going to be a long post. I have included code details to replicate the problem, at the end.

I'm using sg_trace_striped_16 on two nearly identical strings of length ≈51Kbp. When I follow the cigar string to convert the alignment to the kind of pair of strings you'd see in e.g. UCSC MAF format, I notice a strange thing at the right end of the alignment. It has broken what ought to be a long match run by repeatedly inserting and deleting a single base. Below I show what part of that looks like. The three dots to the left represent many thousand bases that seem to be correctly aligned, with a few mismatches or indels where they should be. The three dots to the right represent many thousand bases that exhibit a similar pattern of delete-one-insert-one-match-8-or-so (I also see it at the end of the alignment, but I'm not sure if it exists throughout in between). Note that each deleted/inserted base is a match, not a mismatch, so there seems to be no mathematical reason to delete/insert.

reference: ...CTCATTTACCTTTCA-GTAAAATTG-TTAAATAAG-CAAAATAA-TTTGAGTT-AAAATTAGAAT-AAAAATTGT-CTTTTATTT-GGA...
query:     ...CTCATTTACCTTTCAG-TAAAATTGT-TAAATAAGC-AAAATAATT-TGAGTTAAA-ATTAGAATA-AAAATTGTC-TTTTATTTG-GA...

I've confirmed that the problem is not in the conversion of cigar to text. If I examine the cigar for this alignment, I see

[(2890,'='),(1,'I'),
 (12918,'='),(1,'X'),
 (5026,'='),(1,'X'),
 (2537,'='),(1,'X'),
 (17,'='),(1,'X'),
 (9395,'='),(1,'D'),(1,'I'),(8,'='),(1,'D'),(1,'I'),(8,'='),(1,'D'),(1,'I'),(7,'=') ...]

Moreover, if I remove that first 2890 bp match and single indel from my sequences and run again (i.e. remove 2891 from the reference and 2890 from the query), I get a cigar that agrees with the rest of that original cigar for a while, then diverges. And it, too, appears to eventually devolves into the delete-one-insert-one-match-8-or-so pattern. Mathematically, I would expect this alignment to be identical to the previous, except for the removal of those first two cigar ops.
[(12918, '='), (1, 'X'), (5026, '='), (1, 'X'), (2537, '='), (1, 'X'), (17, '='), (1, 'X'), (11628, '='), (2, 'D'), (656, '='), (1, 'D'), (2, '='), (1, 'I'), (8, '='), (1, 'D'), (1, 'I'), (8, '='), (1, 'D'), (1, 'I') ...]

I'm using (or at least I think I am using) a simple scoring model rewarding 1 for a match, penalizing 3 for a mismatch, 4 for a gap open and 1 for a gap extend. Aligning these two sequences with the same (I hope) scoring model in lastz yields an alignment covering the entirety of the strings, with 5 mismatches and 5 gapped bases.

The two input sequences are available at https://psu.box.com/s/lhwm1zaag8qt0l0ypu1ksfh4ueus1z7x

Details:

import parasail
subsMatrix = parasail.matrix_create("ACGT",1,-3)
refDB = parasail.sequences_from_file("example.chm13.fa")
qryDB = parasail.sequences_from_file("example.hg38.fa")
# extract the nt sequence from ref and qry here
ref = refDB[0]
qry = qryDB[0]
a = parasail.sg_trace_striped_16(ref.seq,qry.seq,4,1,subsMatrix)
aCigar = list(map(lambda v:(a.cigar.decode_len(v),a.cigar.decode_op(v).decode("utf-8")),a.cigar.seq))
len(aCigar) # 6669
aCigar[:20]  # this is the first cigar list I show above

a2 = parasail.sg_trace_striped_16(refNucs[2891:], qryNucs[2890:],4,1,subsMatrix)
s2Cigar = list(map(lambda v:(a2.cigar.decode_len(v),a2.cigar.decode_op(v).decode("utf-8")),a2.cigar.seq))
len(a2Cigar) # 5628 I'd really expect it to just be two shorter than 6669
a2Cigar[:20]  # this is the second cigar list I show above

I also tried parasail.sg_trace_striped_32, thinking maybe some internal scoring was overflowing the 16-bit ints. But I got the same strange result. (I don't know if your implementation is using an overflow-protected recurrence like that in the Suzuki 2017 paper).

Not sure whether I am doing something wrong or not. Perhaps there's a limit to the size of strings allowed? That the pattern seems to repeat at 8 makes me wonder if it has to do with striping.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions