Repeatedly choosing back-to-back single base indels when a match is expected

This is going to be a long post. I have included code  details to replicate the problem, at the end.

I'm using sg_trace_striped_16 on two nearly identical strings of length ≈51Kbp. When I follow the cigar string to convert the alignment to the kind of pair of strings you'd see in e.g. UCSC MAF format, I notice a strange thing at the right end of the alignment. It has broken what ought to be a long match run by repeatedly inserting and deleting a single base. Below I show what part of that looks like. The three dots to the left represent many thousand bases that seem to be correctly aligned, with a few mismatches or indels where they should be. The three dots to the right represent many thousand bases that exhibit a similar pattern of delete-one-insert-one-match-8-or-so (I also see it at the end of the alignment, but I'm not sure if it exists throughout in between). Note that each deleted/inserted base is a match, not a mismatch, so there seems to be no mathematical reason to delete/insert.
```
reference: ...CTCATTTACCTTTCA-GTAAAATTG-TTAAATAAG-CAAAATAA-TTTGAGTT-AAAATTAGAAT-AAAAATTGT-CTTTTATTT-GGA...
query:     ...CTCATTTACCTTTCAG-TAAAATTGT-TAAATAAGC-AAAATAATT-TGAGTTAAA-ATTAGAATA-AAAATTGTC-TTTTATTTG-GA...
```

I've confirmed that the problem is _not_ in the conversion of cigar to text. If I examine the cigar for this alignment, I see
```
[(2890,'='),(1,'I'),
 (12918,'='),(1,'X'),
 (5026,'='),(1,'X'),
 (2537,'='),(1,'X'),
 (17,'='),(1,'X'),
 (9395,'='),(1,'D'),(1,'I'),(8,'='),(1,'D'),(1,'I'),(8,'='),(1,'D'),(1,'I'),(7,'=') ...]
```

Moreover, if I remove that first 2890 bp match and single indel from my sequences and run again (i.e. remove 2891 from the reference and 2890 from the query), I get a cigar that agrees with the rest of that original cigar for a while, then diverges. And it, too, appears to eventually devolves into the delete-one-insert-one-match-8-or-so pattern. Mathematically, I would expect this alignment to be identical to the previous, except for the removal of those first two cigar ops.
`[(12918, '='), (1, 'X'),
   (5026, '='), (1, 'X'),
   (2537, '='), (1, 'X'), 
  (17, '='), (1, 'X'),
  (11628, '='), (2, 'D'),
  (656, '='), (1, 'D'), 
  (2, '='), (1, 'I'), (8, '='), (1, 'D'), (1, 'I'), (8, '='), (1, 'D'), (1, 'I') ...]
`

I'm using (or at least I think I am using) a simple scoring model rewarding 1 for a match, penalizing 3 for a mismatch, 4 for a gap open and 1 for a gap extend. Aligning these two sequences with the same (I hope) scoring model in lastz yields an alignment covering the entirety of the strings, with 5 mismatches and 5 gapped bases.

The two input sequences are available at https://psu.box.com/s/lhwm1zaag8qt0l0ypu1ksfh4ueus1z7x

Details:
```python3
import parasail
subsMatrix = parasail.matrix_create("ACGT",1,-3)
refDB = parasail.sequences_from_file("example.chm13.fa")
qryDB = parasail.sequences_from_file("example.hg38.fa")
# extract the nt sequence from ref and qry here
ref = refDB[0]
qry = qryDB[0]
a = parasail.sg_trace_striped_16(ref.seq,qry.seq,4,1,subsMatrix)
aCigar = list(map(lambda v:(a.cigar.decode_len(v),a.cigar.decode_op(v).decode("utf-8")),a.cigar.seq))
len(aCigar) # 6669
aCigar[:20]  # this is the first cigar list I show above

a2 = parasail.sg_trace_striped_16(refNucs[2891:], qryNucs[2890:],4,1,subsMatrix)
s2Cigar = list(map(lambda v:(a2.cigar.decode_len(v),a2.cigar.decode_op(v).decode("utf-8")),a2.cigar.seq))
len(a2Cigar) # 5628 I'd really expect it to just be two shorter than 6669
a2Cigar[:20]  # this is the second cigar list I show above
```

I also tried parasail.sg_trace_striped_32, thinking maybe some internal scoring was overflowing the 16-bit ints. But I got the same strange result. (I don't know if your implementation is using an overflow-protected recurrence like that in the Suzuki 2017 paper).

Not sure whether I am doing something wrong or not. Perhaps there's a limit to the size of strings allowed? That the pattern seems to repeat at 8 makes me wonder if it has to do with striping.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repeatedly choosing back-to-back single base indels when a match is expected #59

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Repeatedly choosing back-to-back single base indels when a match is expected #59

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions