fix: gap-fill and exact-range trim for post-threshold chunks#904
fix: gap-fill and exact-range trim for post-threshold chunks#904nikooo777 wants to merge 3 commits into
Conversation
- Use fill_gaps in get_chunk_range_fixed_size for interior gaps - Trim trailing overshoot inside fill_gaps to the exact range - Route global no-length chunk requests through single_chunk_suffix - Refresh fetch_chunk_range docstring to match new behavior
| %% @doc Fetch the single upstream chunk containing Offset and return its | ||
| %% suffix from Offset onward. Used when the chunk request has no explicit | ||
| %% length: the caller wants "a chunk's worth" rather than an exact range. | ||
| single_chunk_suffix(Offset, Opts) -> |
There was a problem hiding this comment.
Is this necessary for this PR? Did something break with the default Length=1 handling? (if it's not necessary we can also get rid of the HasExplicitLength check above)
| Bin = iolist_to_binary(Binaries), | ||
| Expected = EndOffset - Offset + 1, | ||
| {ok, binary:part(Bin, 0, min(Expected, byte_size(Bin)))}; |
There was a problem hiding this comment.
Do you know why this is needed here? There is some truncating that happens earlier in the stack - and since fill_gaps was previously called for pre-threshold ranges, is that flow path still good?
There was a problem hiding this comment.
I'm still going through the logic, but on the offset 194_794_421_495_003 example, these are the values:
Expected732228byte_size(Bin)840978
Going through the logic to make sure there is no extra code that we don't need.
There was a problem hiding this comment.
All tests pass if we revert this change, because we have the one inside get_chunk to trim the binary.
|
Extra test I recommend adding to |
| %% @doc Fetch a range of chunks in parallel. Determines the appropriate | ||
| %% algorithm to use based on offset, length, and an optional relative | ||
| %% transaction ID. For global (no relative TX) offsets, returns exactly the | ||
| %% bytes in the inclusive range [Offset, Offset + Length - 1]: both leading | ||
| %% and trailing overshoot are trimmed and any interior gaps in chunk | ||
| %% coverage are filled iteratively. For relative-TX offsets the legacy | ||
| %% concatenation path is used and the caller may receive more than Length | ||
| %% bytes (no trailing trim), so it must truncate the result itself. |
There was a problem hiding this comment.
I'm not sure if this is useful. The function isn't public, and the logic should be simple to understand and the code to read.
There was a problem hiding this comment.
Agree. I generally remove all those explanatory comments that Claude adds.
83336c0 to
7ad7ad6
Compare
| %% For some reason if we wait 1500ms before the request this test doesn't fail | ||
| %% with connect_timeout when runnning in parallel. | ||
| timer:sleep(1500), |
There was a problem hiding this comment.
This is a flaky test. This pass 100% of the time when run as a single test, but fails 80%+ of the time when run in parallel (default now) with checkout_timeout. This error happens on the server side, so this is the response sent with {ok, checkout_timeout} (which is weird, because it's on the server side).
There is a bug in the code, where hackney configurations are sent when setting up test servers, to be fixed in another PR, but still, increasing checkout timeout in hackney doesn't solve the issue, it just keeps waiting until the timeout and fails.
Summary
Fix a silent-corruption bug where
~arweave@2.9/raw=returned wrong bytes for any post-strict-data-split-threshold dataitem whose on-chain chunks didn't happen to align to a 256 KiB stride relative to the dataitem's data start. The advertisedcontent-lengthwas correct; the body was both shorter than that and contained bytes from outside the requested range at the tail.Root cause is in
get_chunk_range_fixed_size/3: it generated candidate query offsets at 256 KiB increments from the requested start, then used a one-extra-chunk fallback if the assembled length was short. That heuristic only covers the trivial "off by one trailing chunk" case — it misses interior gaps when the bucket grid drifts from the request stride, andassemble_chunks/2made it worse by never trimming the trailing overshoot of the fallback chunk. Aligned dataitems happened to work and masked the bug.Changes (all in
src/dev_arweave.erl):get_chunk_range_fixed_size/3now defers tofill_gaps/4, the same iterative gap-fill helper the pre-threshold path already uses. Interior gaps of any shape get detected and refilled until the range is contiguous.fill_gaps/4trims trailing overshoot to exactly[Offset, EndOffset]after assembly;assemble_chunks/2already trimmed the leading edge. The pre-threshold variable-size path also benefits.get_chunk/3routes global no-length chunk requests through a small newsingle_chunk_suffix/2helper. Without this, the tightened trim would collapse no-length responses from "a chunk's suffix" to a single byte, regressing both the publicchunkHTTP endpoint and the internalbundle_header/2caller (which would then reportinvalid_bundle_headerfor every bundle).fetch_chunk_range/4docstring updated to describe the new exact-range contract.Net diff: 51 insertions, 52 deletions in one file. No exported function signatures change; no schema or config changes.
Test plan
~arweave@2.9/raw=<id>and confirm the response body length matches the advertisedcontent-lengthexactly and that the bytes decode end-to-end (e.g. an MP4 or WASM whose section/box walker reaches EOF cleanly). Before fix: same request returned ~33 KB short with non-payload bytes near the end.~arweave@2.9/chunk?offset=...without alengthparameter and confirm the response is a chunk's suffix (tens of KiB), not 1 byte. Confirms the no-length contract is preserved for both external HTTP callers andbundle_header/2.rebar3 eunit: alldev_arweavetests pass (35/35), including the post-split chunk path tests. The only failures across the suite are three pre-existing network-dependentdev_nameARNS tests that reproduce identically on unmodifiededge.