Skip to content

Bounds-check IFD count to prevent DoS on crafted TIFFs#1527

Merged
brendancol merged 2 commits intoxarray-contrib:mainfrom
brendancol:fix/ifd-count-bounds
May 9, 2026
Merged

Bounds-check IFD count to prevent DoS on crafted TIFFs#1527
brendancol merged 2 commits intoxarray-contrib:mainfrom
brendancol:fix/ifd-count-bounds

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Summary

Addresses security audit finding S2: parse_ifd in xrspatial/geotiff/_header.py previously trusted an entry's count field unconditionally and passed it straight into struct.unpack_from(f'{bo}{count}{fmt}', ...). A crafted multi-MB TIFF with count=5e7 on a LONG tag and a value pointer aimed at a region of valid file bytes forced the parser to materialize a ~1.4 GB tuple of Python ints before any caller-side validation ran.

Threat model

Untrusted TIFF input: web download, fsspec source, third-party catalog, user upload. An attacker only needs to ship a file large enough to cover the supposed value range (count * type_size bytes) to weaponize the count field.

Fix

Two guards added inside the per-entry loop in parse_ifd:

  1. Hard cap MAX_IFD_ENTRY_COUNT = 10_000_000 on count for tags that are not pixel-data offset/byte-count arrays. STRIP_OFFSETS, STRIP_BYTE_COUNTS, TILE_OFFSETS, TILE_BYTE_COUNTS, and COLORMAP are exempt because their counts legitimately scale with image dimensions (a real high-resolution tiled COG can have millions of tiles).
  2. For non-inline values, verify that value_offset + count * type_size <= len(data) before calling _read_value.

Both checks raise ValueError before any large allocation runs.

Test plan

  • pytest xrspatial/geotiff/tests/test_parse_ifd_bounds.py -x -q (5 new tests)
  • pytest xrspatial/geotiff/tests/test_header.py xrspatial/geotiff/tests/test_reader.py -x -q (34 passed)
  • Full geotiff suite passes apart from three pre-existing palette/deepcopy failures in test_features.py unrelated to this change
  • Manual reproduction of the audit scenario (count=5e7 LONG on ImageWidth against a 250 MB padded file) now raises ValueError with no large allocation

`parse_ifd` previously trusted an entry's `count` field unconditionally
and passed it straight into `struct.unpack_from(f'{bo}{count}{fmt}',
...)`. A crafted TIFF with `count=5e7` on a LONG tag and a value
pointer aimed at a multi-MB padding region forced the parser to
materialize a ~1.4 GB tuple of Python ints before any caller-side
validation ran.

This change adds two guards inside the per-entry loop:

1. Reject `count > MAX_IFD_ENTRY_COUNT` (10_000_000) for tags that are
   not pixel-data offset/byte-count arrays. STRIP_OFFSETS,
   STRIP_BYTE_COUNTS, TILE_OFFSETS, TILE_BYTE_COUNTS, and COLORMAP are
   exempt because their counts legitimately scale with image size.
2. For non-inline values, verify that
   `value_offset + count * type_size <= len(data)` before calling
   `_read_value`.

Both checks raise `ValueError` before any allocation. Threat model is
untrusted TIFF input (web download, fsspec source, third-party catalog,
user upload).

Tests added in test_parse_ifd_bounds.py:
- count over the cap on a non-pixel tag is rejected
- value range past EOF is rejected
- pixel-array tag with a huge count is allowed
- legal small IFDs still parse
- TileByteCounts SHORT path exercised
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 8, 2026
@brendancol brendancol requested a review from Copilot May 8, 2026 22:12
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the GeoTIFF header/IFD parser against memory-DoS attacks by validating IFD entry count values and referenced value ranges before decoding tag payloads.

Changes:

  • Add MAX_IFD_ENTRY_COUNT and enforce a hard cap for non–pixel-array tags in parse_ifd.
  • Add bounds checking for non-inline IFD value pointers to ensure value_offset + count * type_size stays within len(data).
  • Add a new test module covering count cap enforcement, EOF bounds checks, and pixel-tag exemptions.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
xrspatial/geotiff/_header.py Adds count hard-cap for non-pixel tags and value-range bounds checks before _read_value / struct.unpack_from.
xrspatial/geotiff/tests/test_parse_ifd_bounds.py Introduces tests for the new guards and intended exemptions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread xrspatial/geotiff/tests/test_parse_ifd_bounds.py Outdated
Comment thread xrspatial/geotiff/_header.py Outdated
Comment thread xrspatial/geotiff/_header.py Outdated
Comment thread xrspatial/geotiff/tests/test_parse_ifd_bounds.py Outdated
Tighten the IFD-entry caps and clean up test scaffolding:

- Lower MAX_IFD_ENTRY_COUNT 10_000_000 -> 100_000. Even at the lower
  bound a malformed entry can still force ~3 MB of Python tuple
  allocation for LONG values; 100K is well above any legitimate
  metadata count and bounds the worst case to ~30 MB.
- Add MAX_IFD_ENTRY_BYTES = 256 KiB cap on `count * type_size` for
  non-pixel tags. Catches large-itemsize abuse (e.g. DOUBLE arrays)
  where the count alone passes but the bytes are excessive. Default
  caps are now independently useful: count cap stops integer-array
  attacks, byte cap stops large-element-size attacks.
- Remove the dead `ptr < 0` branch on the value-range check; ptr is
  unpacked via unsigned `I`/`Q` so it can never be negative.
- Reformat _build_tiff signature in the test to one-arg-per-line so
  the closing paren lines up cleanly (prior layout tripped E128/E124).
- Replace test_huge_pixel_array_count_allowed (which allocated 40 MB
  and forced a 10M-int tuple in CI) with monkeypatch-based tests that
  exercise the cap logic with tiny payloads. Add a second test that
  proves the byte cap fires at default cap values, so the production
  thresholds stay covered.
@brendancol brendancol merged commit 8a71679 into xarray-contrib:main May 9, 2026
11 checks passed
brendancol added a commit to brendancol/xarray-spatial that referenced this pull request May 9, 2026
`parse_all_ifds` walks the IFD chain via `next_ifd_offset` and
deduplicates with a `seen` set of offsets. That catches cycles, but
not long acyclic chains: a crafted BigTIFF can chain millions of
distinct IFD offsets that each point at small valid IFDs scattered
through a sparse multi-GB file, forcing O(N) memory in attacker-
controlled N.

This change adds a `MAX_IFDS = 256` ceiling and raises `ValueError`
once the chain hits it. 256 is generous: real-world COGs carry the
full-resolution IFD plus a handful of overview levels and (optionally)
per-band masks, so they sit well under 64 even for deep pyramids. The
cycle-detection `seen` set is preserved untouched.

Threat model is untrusted TIFF input (web download, fsspec source,
third-party catalog, user upload). Counterpart to S2 (xarray-contrib#1527), which
bounded per-IFD entry counts.

Tests in test_ifd_chain_cap.py:
- chain past the cap is rejected with a clear `ValueError`
- chain at MAX_IFDS - 1 still parses
- chain at MAX_IFDS hits the cap (off-by-one boundary)
- error message names MAX_IFDS and the threat
- big-endian chains hit the same cap
- a real COG with overview levels parses unaffected
brendancol added a commit that referenced this pull request May 9, 2026
* Cap IFD chain length in parse_all_ifds to prevent DoS

`parse_all_ifds` walks the IFD chain via `next_ifd_offset` and
deduplicates with a `seen` set of offsets. That catches cycles, but
not long acyclic chains: a crafted BigTIFF can chain millions of
distinct IFD offsets that each point at small valid IFDs scattered
through a sparse multi-GB file, forcing O(N) memory in attacker-
controlled N.

This change adds a `MAX_IFDS = 256` ceiling and raises `ValueError`
once the chain hits it. 256 is generous: real-world COGs carry the
full-resolution IFD plus a handful of overview levels and (optionally)
per-band masks, so they sit well under 64 even for deep pyramids. The
cycle-detection `seen` set is preserved untouched.

Threat model is untrusted TIFF input (web download, fsspec source,
third-party catalog, user upload). Counterpart to S2 (#1527), which
bounded per-IFD entry counts.

Tests in test_ifd_chain_cap.py:
- chain past the cap is rejected with a clear `ValueError`
- chain at MAX_IFDS - 1 still parses
- chain at MAX_IFDS hits the cap (off-by-one boundary)
- error message names MAX_IFDS and the threat
- big-endian chains hit the same cap
- a real COG with overview levels parses unaffected

* Address Copilot review on #1530

- Remove unused TAG_IMAGE_LENGTH import (F401).
- Fix test_chain_at_boundary_passes docstring: the summary said "Exactly
  MAX_IFDS is allowed" but the body and code show MAX_IFDS - 1 is the
  largest accepted chain. Reword the summary to match.

* Harmonize MAX_IFDS check with MAX_IFD_ENTRY_COUNT semantics

Switch parse_all_ifds from `len(ifds) >= MAX_IFDS` to
`len(ifds) > MAX_IFDS` so the cap matches the `> MAX_IFD_ENTRY_COUNT`
convention used elsewhere in this module: "MAX = N" reads as "up to and
including N is allowed". Update the boundary test to assert MAX_IFDS
parses cleanly and MAX_IFDS + 1 raises.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants