Skip to content

Make faidx work with very long (>4 Gbyte!) lines#2008

Open
daviesrob wants to merge 1 commit into
samtools:developfrom
daviesrob:faidx-very-long-lines
Open

Make faidx work with very long (>4 Gbyte!) lines#2008
daviesrob wants to merge 1 commit into
samtools:developfrom
daviesrob:faidx-very-long-lines

Conversation

@daviesrob
Copy link
Copy Markdown
Member

@daviesrob daviesrob commented May 12, 2026

Although faidx should support very long references, writing one longer than 4Gbases on a single line broke it because it used a uint32_t field to store the line length.

To make it work with such inputs, faidx1_t::line_blen is increased in size to uint64_t so the correct length can be stored. To avoid having to do the same for faidx1_t::line_len, which would make each entry quite a bit bigger for a fairly rare use-case, that field is changed so that it stores the number of bytes to be skipped at the end of each line instead of the full length. As this value will usually only be 1 or 2, a uint32_t is plenty big enough for it. Combined with the fact that the original structure had a four-byte hole in it (between line_blen and len), it's possible to store the longer line lengths while keeping faidx1_t exactly the same size as it had before.

Fixes samtools/samtools#2331

Comment thread faidx.c Outdated

while ((l = hgetln(buf, 0x10000, fp)) > 0) {
uint32_t line_len, line_blen, n;
uint64_t line_len, line_blen, n;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't affect the behaviour, but this is a good opportunity to make n a plain int.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, it's now an int.

Although faidx should support very long references, writing one
longer than 4Gbases on a single line broke it because it used
a uint32_t field to store the line length.

To make it work with such inputs, faidx1_t::line_blen is increased
in size to uint64_t so the correct length can be stored.  To
avoid having to do the same for faidx1_t::line_len, which would
make each entry quite a bit bigger for a fairly rare use-case,
that field is changed so that it stores the number of bytes to
be skipped at the end of each line instead of the full length.
As this value will usually only be 1 or 2, a uint32_t is plenty
big enough for it.  Combined with the fact that the original
structure had a four-byte hole in it (between line_blen and len),
it's possible to store the longer line lengths while keeping
faidx1_t exactly the same size as it had before.
Comment thread faidx.c
return -1;
else
return val.line_blen;
return (hts_pos_t) (val.line_blen <= HTS_POS_MAX ? val.line_blen : HTS_POS_MAX);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should return -1 rather than using saturated maths and returning HTS_POS_MAX, as providing the wrong value here will then cause a calculation of the offset to read from to be incorrect in anything that calls this function.

However that said, it's a bit of a technicality and a moot point.

  1. Samtools promptly turns -1 into HTS_POS_MAX anyway. Wrongly, causing the subsequent file offset to be incorrect.
  2. We can't have files that big as it's more storage than anyone has! So we'd fail at a different point making it somewhat moot.

@jkbonfield
Copy link
Copy Markdown
Contributor

I managed to break it with a malformed fai file:

foo	5000000000	5	-5000000000	-5000000000
$ samtools faidx long.fa foo:4999999999-
=================================================================
==1329386==ERROR: AddressSanitizer: negative-size-param: (size=-9999999998)
    #0 0x7f8997e84bad in memcpy /tmp/gcc-16.1.0/libsanitizer/sanitizer_common/sanitizer_common_interceptors_memintrinsics.inc:117
    #1 0x00000079a2a2 in bgzf_read_small (/nfs/users/nfs_j/jkb/work/samtools_master/samtools/samtools+0x79a2a2)
    #2 0x00000079d566 in fai_retrieve (/nfs/users/nfs_j/jkb/work/samtools_master/samtools/samtools+0x79d566)
    #3 0x00000079d937 in fai_fetch64 (/nfs/users/nfs_j/jkb/work/samtools_master/samtools/samtools+0x79d937)
    #4 0x000000582293 in write_output /nfs/users/nfs_j/jkb/work/samtools_master/samtools/faidx.c:272
    #5 0x0000005860b1 in faidx_core /nfs/users/nfs_j/jkb/work/samtools_master/samtools/faidx.c:640
    #6 0x0000005866db in faidx_main /nfs/users/nfs_j/jkb/work/samtools_master/samtools/faidx.c:686
    #7 0x000000502dac in main /nfs/users/nfs_j/jkb/work/samtools_master/samtools/bamtk.c:254
    #8 0x7f8996e68d8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
    #9 0x7f8996e68e3f in __libc_start_main_impl ../csu/libc-start.c:392
    #10 0x000000408794 in _start (/nfs/users/nfs_j/jkb/work/samtools_master/samtools/samtools+0x408794)

0x7f89957ad800 is located 0 bytes inside of 131072-byte region [0x7f89957ad800,0x7f89957cd800)
allocated by thread T0 here:
    #0 0x7f8997e871df in malloc /tmp/gcc-16.1.0/libsanitizer/asan/asan_malloc_linux.cpp:67
    #1 0x0000007952e7 in bgzf_read_init /nfs/users/nfs_j/jkb/work/samtools_master/htslib/bgzf.c:394

SUMMARY: AddressSanitizer: negative-size-param (/nfs/users/nfs_j/jkb/work/samtools_master/samtools/samtools+0x79a2a2) in bgzf_read_small
==1329386==ABORTING

One problem here is perhaps the logic in bgzf_read_small as the cast makes over-sized queries pass the < test:

154	    if ((ssize_t)length < fp->block_length - fp->block_offset) {
155	        // Short cut the common and easy mode
156	        memcpy((uint8_t *)data,
157	               (uint8_t *)fp->uncompressed_block + fp->block_offset,
158	               length);

length here is a very large unsigned value, but -9999999998 after the cast. Obviously it needs fixing upstream of this, but it does expose a potential weakness in other pieces of code that may be able to trigger the same case.

The faidx.c firstline_len = line_len - beg % val->line_blen line has firstline_len as signed and line_len as unsigned, which triggers an overflow, but it ought to have been weeded out in the initial setting of val->line_blen and val->line_extra in fai_insert_index.

Eg

diff --git a/faidx.c b/faidx.c
index f2539b67..7fb0ad98 100644
--- a/faidx.c
+++ b/faidx.c
@@ -95,6 +95,11 @@ static int fai_name2id(void *v, const char *ref)
 
 static inline int fai_insert_index(faidx_t *idx, const char *name, uint64_t len, uint64_t line_len, uint64_t line_blen, uint64_t seq_offset, uint64_t qual_offset)
 {
+    if ((hts_pos_t)line_len < 0 || (hts_pos_t)line_blen < 0) {
+        hts_log_error("Malformed line: line length is out of bounds");
+        return -1;
+    }
+
     if (!name) {
         hts_log_error("Malformed line: no name");
         return -1;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

samtools faidx incorrrent sequence extraction when scaffold length is very long (>2.1Gb)

3 participants