Make faidx work with very long (>4 Gbyte!) lines#2008
Conversation
|
|
||
| while ((l = hgetln(buf, 0x10000, fp)) > 0) { | ||
| uint32_t line_len, line_blen, n; | ||
| uint64_t line_len, line_blen, n; |
There was a problem hiding this comment.
It doesn't affect the behaviour, but this is a good opportunity to make n a plain int.
There was a problem hiding this comment.
Agreed, it's now an int.
Although faidx should support very long references, writing one longer than 4Gbases on a single line broke it because it used a uint32_t field to store the line length. To make it work with such inputs, faidx1_t::line_blen is increased in size to uint64_t so the correct length can be stored. To avoid having to do the same for faidx1_t::line_len, which would make each entry quite a bit bigger for a fairly rare use-case, that field is changed so that it stores the number of bytes to be skipped at the end of each line instead of the full length. As this value will usually only be 1 or 2, a uint32_t is plenty big enough for it. Combined with the fact that the original structure had a four-byte hole in it (between line_blen and len), it's possible to store the longer line lengths while keeping faidx1_t exactly the same size as it had before.
da343ee to
60ac4ea
Compare
| return -1; | ||
| else | ||
| return val.line_blen; | ||
| return (hts_pos_t) (val.line_blen <= HTS_POS_MAX ? val.line_blen : HTS_POS_MAX); |
There was a problem hiding this comment.
This should return -1 rather than using saturated maths and returning HTS_POS_MAX, as providing the wrong value here will then cause a calculation of the offset to read from to be incorrect in anything that calls this function.
However that said, it's a bit of a technicality and a moot point.
- Samtools promptly turns -1 into HTS_POS_MAX anyway. Wrongly, causing the subsequent file offset to be incorrect.
- We can't have files that big as it's more storage than anyone has! So we'd fail at a different point making it somewhat moot.
|
I managed to break it with a malformed fai file: One problem here is perhaps the logic in
The faidx.c Eg |
Although faidx should support very long references, writing one longer than 4Gbases on a single line broke it because it used a
uint32_tfield to store the line length.To make it work with such inputs,
faidx1_t::line_blenis increased in size touint64_tso the correct length can be stored. To avoid having to do the same forfaidx1_t::line_len, which would make each entry quite a bit bigger for a fairly rare use-case, that field is changed so that it stores the number of bytes to be skipped at the end of each line instead of the full length. As this value will usually only be 1 or 2, auint32_tis plenty big enough for it. Combined with the fact that the original structure had a four-byte hole in it (betweenline_blenandlen), it's possible to store the longer line lengths while keepingfaidx1_texactly the same size as it had before.Fixes samtools/samtools#2331