Skip to content

fix: gap-fill and exact-range trim for post-threshold chunks#904

Open
nikooo777 wants to merge 3 commits into
edgefrom
fix/post-threshold-chunk-gap-fill
Open

fix: gap-fill and exact-range trim for post-threshold chunks#904
nikooo777 wants to merge 3 commits into
edgefrom
fix/post-threshold-chunk-gap-fill

Conversation

@nikooo777
Copy link
Copy Markdown
Collaborator

Summary

Fix a silent-corruption bug where ~arweave@2.9/raw= returned wrong bytes for any post-strict-data-split-threshold dataitem whose on-chain chunks didn't happen to align to a 256 KiB stride relative to the dataitem's data start. The advertised content-length was correct; the body was both shorter than that and contained bytes from outside the requested range at the tail.

Root cause is in get_chunk_range_fixed_size/3: it generated candidate query offsets at 256 KiB increments from the requested start, then used a one-extra-chunk fallback if the assembled length was short. That heuristic only covers the trivial "off by one trailing chunk" case — it misses interior gaps when the bucket grid drifts from the request stride, and assemble_chunks/2 made it worse by never trimming the trailing overshoot of the fallback chunk. Aligned dataitems happened to work and masked the bug.

Changes (all in src/dev_arweave.erl):

  • get_chunk_range_fixed_size/3 now defers to fill_gaps/4, the same iterative gap-fill helper the pre-threshold path already uses. Interior gaps of any shape get detected and refilled until the range is contiguous.
  • fill_gaps/4 trims trailing overshoot to exactly [Offset, EndOffset] after assembly; assemble_chunks/2 already trimmed the leading edge. The pre-threshold variable-size path also benefits.
  • get_chunk/3 routes global no-length chunk requests through a small new single_chunk_suffix/2 helper. Without this, the tightened trim would collapse no-length responses from "a chunk's suffix" to a single byte, regressing both the public chunk HTTP endpoint and the internal bundle_header/2 caller (which would then report invalid_bundle_header for every bundle).
  • fetch_chunk_range/4 docstring updated to describe the new exact-range contract.

Net diff: 51 insertions, 52 deletions in one file. No exported function signatures change; no schema or config changes.

Test plan

  • Index a block containing a post-threshold L2 dataitem whose chunks don't tile cleanly with a 256 KiB stride from its data start. Confirm the indexer detects bundles and writes per-item offsets as before.
  • Fetch the dataitem via ~arweave@2.9/raw=<id> and confirm the response body length matches the advertised content-length exactly and that the bytes decode end-to-end (e.g. an MP4 or WASM whose section/box walker reaches EOF cleanly). Before fix: same request returned ~33 KB short with non-payload bytes near the end.
  • Hit ~arweave@2.9/chunk?offset=... without a length parameter and confirm the response is a chunk's suffix (tens of KiB), not 1 byte. Confirms the no-length contract is preserved for both external HTTP callers and bundle_header/2.
  • rebar3 eunit: all dev_arweave tests pass (35/35), including the post-split chunk path tests. The only failures across the suite are three pre-existing network-dependent dev_name ARNS tests that reproduce identically on unmodified edge.

- Use fill_gaps in get_chunk_range_fixed_size for interior gaps
- Trim trailing overshoot inside fill_gaps to the exact range
- Route global no-length chunk requests through single_chunk_suffix
- Refresh fetch_chunk_range docstring to match new behavior
Comment thread src/dev_arweave.erl Outdated
%% @doc Fetch the single upstream chunk containing Offset and return its
%% suffix from Offset onward. Used when the chunk request has no explicit
%% length: the caller wants "a chunk's worth" rather than an exact range.
single_chunk_suffix(Offset, Opts) ->
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary for this PR? Did something break with the default Length=1 handling? (if it's not necessary we can also get rid of the HasExplicitLength check above)

Comment thread src/dev_arweave.erl Outdated
Comment on lines +503 to +505
Bin = iolist_to_binary(Binaries),
Expected = EndOffset - Offset + 1,
{ok, binary:part(Bin, 0, min(Expected, byte_size(Bin)))};
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why this is needed here? There is some truncating that happens earlier in the stack - and since fill_gaps was previously called for pre-threshold ranges, is that flow path still good?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still going through the logic, but on the offset 194_794_421_495_003 example, these are the values:

  • Expected 732228
  • byte_size(Bin) 840978

Going through the logic to make sure there is no extra code that we don't need.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All tests pass if we revert this change, because we have the one inside get_chunk to trim the binary.

@JamesPiechota
Copy link
Copy Markdown
Collaborator

Extra test I recommend adding to dev_arweave:

get_post_split_mid_chunk_large_module_test_parallel() ->
    Offset = 194_794_421_495_003,
    ExpectedLength = 732_228,
    {ok, Data} = hb_ao:resolve(
        #{ <<"device">> => <<"arweave@2.9">> },
        #{
            <<"path">> => <<"chunk">>,
            <<"offset">> => Offset + 1,
            <<"length">> => ExpectedLength
        },
        #{}
    ),
    ?assertEqual(ExpectedLength, byte_size(Data)).

Comment thread src/dev_arweave.erl Outdated
Comment on lines +406 to +413
%% @doc Fetch a range of chunks in parallel. Determines the appropriate
%% algorithm to use based on offset, length, and an optional relative
%% transaction ID. For global (no relative TX) offsets, returns exactly the
%% bytes in the inclusive range [Offset, Offset + Length - 1]: both leading
%% and trailing overshoot are trimmed and any interior gaps in chunk
%% coverage are filled iteratively. For relative-TX offsets the legacy
%% concatenation path is used and the caller may receive more than Length
%% bytes (no trailing trim), so it must truncate the result itself.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is useful. The function isn't public, and the logic should be simple to understand and the code to read.

Copy link
Copy Markdown
Collaborator

@JamesPiechota JamesPiechota May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. I generally remove all those explanatory comments that Claude adds.

@speeddragon speeddragon force-pushed the fix/post-threshold-chunk-gap-fill branch from 83336c0 to 7ad7ad6 Compare May 12, 2026 16:44
Comment thread src/dev_arweave.erl
Comment on lines +1043 to +1045
%% For some reason if we wait 1500ms before the request this test doesn't fail
%% with connect_timeout when runnning in parallel.
timer:sleep(1500),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a flaky test. This pass 100% of the time when run as a single test, but fails 80%+ of the time when run in parallel (default now) with checkout_timeout. This error happens on the server side, so this is the response sent with {ok, checkout_timeout} (which is weird, because it's on the server side).

There is a bug in the code, where hackney configurations are sent when setting up test servers, to be fixed in another PR, but still, increasing checkout timeout in hackney doesn't solve the issue, it just keeps waiting until the timeout and fails.

Copy link
Copy Markdown
Collaborator

@JamesPiechota JamesPiechota left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants