GH-3542 Fix LocalInputFile ByteBuffer read methods to support direct and non-zero-position buffers#3543
Open
arouel wants to merge 2 commits intoapache:masterfrom
Open
GH-3542 Fix LocalInputFile ByteBuffer read methods to support direct and non-zero-position buffers#3543arouel wants to merge 2 commits intoapache:masterfrom
LocalInputFile ByteBuffer read methods to support direct and non-zero-position buffers#3543arouel wants to merge 2 commits intoapache:masterfrom
Conversation
Summary Root cause (parquet-common/src/main/java/org/apache/parquet/io/LocalInputFile.java) Both read(ByteBuffer) and readFully(ByteBuffer) called: buf.put(buffer, buf.position() + buf.arrayOffset(), buf.remaining()); Two independent bugs: 1. Wrong put overload semantics. ByteBuffer.put(byte[] src, int offset, int length) reads src[offset..offset+length]. The code passed buf.position() + buf.arrayOffset() as the source offset, but the source is the freshly-allocated buffer array whose indices have no relationship to buf.position() or buf.arrayOffset(). On heap buffers with position=0 and arrayOffset=0 this happens to work; any other state reads from the wrong offset (or throws ArrayIndexOutOfBoundsException). 2. arrayOffset() is not universally defined. Direct buffers, memory-mapped buffers, and read-only views all throw UnsupportedOperationException from arrayOffset(). Any Parquet consumer passing such a buffer (which is exactly what the steampipe integration test triggers via ParquetFileReader.readFooter) blows up at the first call. 3. Partial-read accounting. read(ByteBuffer) unconditionally copied buf.remaining() bytes regardless of how many read(byte[]) actually returned, corrupting the destination buffer on short reads / EOF. Fix - readFully(ByteBuffer) → buf.put(buffer) (straight array-to-buffer copy, no arrayOffset()). - read(ByteBuffer) → only put read bytes when read > 0, return read (including -1 at EOF). Tests added TestLocalInputOutput gained 8 regression cases in parquet-common/src/test/java/org/apache/parquet/io/: - readFullyIntoHeapByteBuffer — baseline happy path. - readFullyIntoHeapByteBufferWithNonZeroPosition — exercises the old wrong-source-offset bug. - readFullyIntoDirectByteBuffer — direct buffers; arrayOffset() would have thrown. - readFullyIntoReadOnlyByteBuffer — read-only views; asserts ReadOnlyBufferException (correct JDK behaviour after the fix). - readIntoHeapByteBuffer — baseline. - readIntoByteBufferAdvancesPositionByBytesRead — exercises the partial-read accounting bug. - readIntoByteBufferReturnsMinusOneAtEof — EOF signalling. - readIntoDirectByteBuffer — direct buffer read path.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale for this change
LocalInputFile'sread(ByteBuffer)andreadFully(ByteBuffer)are broken for two independent reasons:buf.position() + buf.arrayOffset()as the source offset toByteBuffer.put(byte[] src, int offset, int length). The source is a freshly-allocated local byte[] whose indices have no relationship tobuf.position()orbuf.arrayOffset(), this reads from the wrong offset of the source array whenever the destination buffer's position is non-zero.buf.arrayOffset(), which throwsUnsupportedOperationExceptionon direct buffers, memory-mapped buffers, and read-only views.ParquetFileReader.readFooterpasses exactly such buffer shapes, so any consumer wrapping a Path with newLocalInputFile(path)can hit this bug.read(ByteBuffer)has an additional defect: it advances the destination bybuf.remaining()regardless of how many bytesread(byte[])actually returned, corrupting the buffer on short reads and at EOF.What changes are included in this PR?
readFully(ByteBuffer)now usesbuf.put(buffer), a straight array-to-buffer copy that works for every ByteBuffer shape and never calls arrayOffset().read(ByteBuffer)copies only the bytes actually read (buf.put(buffer, 0, read)whenread > 0) and returns the underlying read result, so-1at EOF propagates correctly and the destination's position is left untouched on EOF.Are these changes tested?
Yes.
TestLocalInputOutputgains eight regression tests that fail against the previous implementation and pass with the fix.Are there any user-facing changes?
No API signatures changed.
Closes #3542