[QDP] Integrate Apache Arrow and Parquet for data processing #680

guan404ming · 2025-12-04T11:29:03Z

Purpose of PR

Integrates Apache Arrow to enable efficient columnar data processing for quantum encoding operations. This addition provides a standardized path for handling structured data inputs alongside the existing raw &[f64] interface.

Related Issues or PRs

Changes Made

Breaking Changes

Yes
No

Checklist

Added or updated unit tests for all changes
Added or updated documentation for all changes
Successfully built and ran all unit tests or manual tests locally
PR title follows "MAHOUT-XXX: Brief Description" format (if related to an issue)
Code follows ASF guidelines

rich7420 · 2025-12-04T15:15:09Z

Thanks for the patch @guan404ming !!!
I think we could check about zero-copy parts like arrow_to_vec and Vec::push.
They will cause extra memory-copy and memory-reallocation.

guan404ming · 2025-12-04T16:40:21Z

Thanks for the patch @guan404ming !!!
I think we could check about zero-copy parts like arrow_to_vec and Vec::push.
They will cause extra memory-copy and memory-reallocation.

Nice suggestion, I've updated with much more optimized version. Thanks!

400Ping · 2025-12-04T17:55:25Z

qdp/qdp-core/src/io.rs

+    )]));
+
+    // Create Float64Array from slice
+    let array = Float64Array::from(Vec::from(data));


You could use Float64Array::from_iter_values(data.iter().copied()) to avoid the extra allocation.

400Ping · 2025-12-04T17:57:21Z

qdp/qdp-core/src/io.rs

+pub fn read_parquet_to_arrow<P: AsRef<Path>>(path: P) -> Result<Float64Array> {
+    let data = read_parquet(path)?;
+    Ok(Float64Array::from(data))
+}


Directly constructing an Arrow array via ParquetRecordBatchReader would avoid an extra copy.

rich7420 · 2025-12-05T01:30:28Z

Looks good, but the current implementation forces a memory copy (Vec allocation) even when we want to use Arrow directly. We should refactor io.rs so that read_parquet_to_arrow is the base implementation, ensuring true zero-copy performance for the pipeline.
origin: Disk -> Arrow -> Vec (copy) -> Arrow (copy) -> GPU
we need: Disk -> Arrow -> Arrow (Zero-copy Reference) -> GPU (through Pointer)
I think so, plz correct me if I'm wrong.

400Ping · 2025-12-05T01:33:33Z

Looks good, but the current implementation forces a memory copy (Vec allocation) even when we want to use Arrow directly. We should refactor io.rs so that read_parquet_to_arrow is the base implementation, ensuring true zero-copy performance for the pipeline.

origin: Disk -> Arrow -> Vec (copy) -> Arrow (copy) -> GPU

we need: Disk -> Arrow -> Arrow (Zero-copy Reference) -> GPU (through Pointer)

I think so, plz correct me if I'm wrong.

I agree with @rich7420 as well.

guan404ming · 2025-12-05T02:24:41Z

Looks good, but the current implementation forces a memory copy (Vec allocation) even when we want to use Arrow directly. We should refactor io.rs so that read_parquet_to_arrow is the base implementation, ensuring true zero-copy performance for the pipeline.
origin: Disk -> Arrow -> Vec (copy) -> Arrow (copy) -> GPU
we need: Disk -> Arrow -> Arrow (Zero-copy Reference) -> GPU (through Pointer)
I think so, plz correct me if I'm wrong.

I think you're right, thanks for pointing out. I've updated the implementation to only.
Disk ----> Arrow Buffers (pointer only) -----> GPU

ryankert01 · 2025-12-05T03:01:15Z

qdp/qdp-core/tests/parquet_io.rs

+}
+
+#[test]
+fn test_chunked_zero_copy_api() {


I know it's hard but should we verify if the mem address are the same? Looks like it only do value check.

Might be something like this: (which only ensure the pointer's in mmap)

let chunk_ptr = chunks[0].values().as_ptr() as *const u8; assert!(chunk_ptr >= mmap_ptr); assert!(chunk_ptr < unsafe { mmap_ptr.add(mmap_len) });

@ryankert01 you're right. It seems current test verifies data integrity but not the zero-copy behavior. I think that verifying memory addresses strictly requires unsafe access to the underlying Arrow buffer pointers. We should add a test case that inspects array.values().as_ptr() to ensure it aligns with expectations, though fully verifying mmap alignment is tricky in tests.

this can be a followup too

rich7420 · 2025-12-05T10:12:13Z

@guan404ming Great work on the io refactor! Thanks!
We need to upgrade our amplitude.rs because:
encode_from_parquet currently falls back to encode_chunked, which merges all chunks into a huge Vec on CPU. This breaks zero-copy for large files.
We should override encode_chunked in AmplitudeEncoder to stream chunks directly to the GPU without merging.
What do you think?
I could send a PR for this part.

guan404ming · 2025-12-05T10:50:40Z

@guan404ming Great work on the io refactor! Thanks!
We need to upgrade our amplitude.rs because:
encode_from_parquet currently falls back to encode_chunked, which merges all chunks into a huge Vec on CPU. This breaks zero-copy for large files.
We should override encode_chunked in AmplitudeEncoder to stream chunks directly to the GPU without merging.
What do you think?
I could send a PR for this part.

I previously plan to send following PR for this but forget to add in PR description. You definitely could help with this, thanks!

400Ping

OverAll LGTM

guan404ming · 2025-12-05T15:08:31Z

After our offline discussion and investigation, we confirmed that Parquet cannot achieve true zero-copy from disk to memory because its data is stored in compressed and encoded form and must be decoded before use. We’ll continue keeping Parquet for now given its convenience and practicality. Also I’ve updated the TODO to migrate the encoder to a chunk-based API.

I would follow up on Arrow IPC as a potential path toward a real zero-copy data pipeline.

guan404ming · 2025-12-05T15:11:30Z

cc @rich7420 @ryankert01 @400Ping

rich7420 · 2025-12-05T15:12:41Z

Sure! I will improve the path of RAM -> GPU on #689 .

guan404ming · 2025-12-05T15:14:42Z

I'm going to merge this, please feel free to open PR to refine this. Thanks for all review and discussion!

guan404ming · 2025-12-05T15:15:14Z

LOL

rich7420 · 2025-12-05T15:15:56Z

Speed as Rocket.

) * Integrate Apache Arrow & Parquet for data processing * Optimize Arrow Float64Array handling in io * Add chunked Arrow Float64Array support * Refactor encoding to support chunked Arrow Float64Array input * Refactor I/O and encoding documentation to remove zero-copy

guan404ming changed the base branch from main to dev-qdp December 4, 2025 11:29

guan404ming marked this pull request as ready for review December 4, 2025 11:33

guan404ming force-pushed the integrate-arrow-rs branch from 77ff73f to 8ba2362 Compare December 4, 2025 11:35

guan404ming changed the title ~~Integrate Apache Arrow for data processing~~ [QDP] Integrate Apache Arrow for data processing Dec 4, 2025

guan404ming requested a review from rich7420 December 4, 2025 11:35

guan404ming marked this pull request as draft December 4, 2025 11:41

guan404ming force-pushed the integrate-arrow-rs branch from 8ba2362 to 57ff8ee Compare December 4, 2025 12:04

Integrate Apache Arrow & Parquet for data processing

edee2a3

guan404ming force-pushed the integrate-arrow-rs branch from 57ff8ee to edee2a3 Compare December 4, 2025 12:07

Optimize Arrow Float64Array handling in io

1b8cb62

guan404ming marked this pull request as ready for review December 4, 2025 16:38

400Ping reviewed Dec 4, 2025

View reviewed changes

Add chunked Arrow Float64Array support

17375f1

Refactor encoding to support chunked Arrow Float64Array input

df7ee1a

ryankert01 reviewed Dec 5, 2025

View reviewed changes

400Ping approved these changes Dec 5, 2025

View reviewed changes

Refactor I/O and encoding documentation to remove zero-copy

ebdcda2

guan404ming changed the title ~~[QDP] Integrate Apache Arrow for data processing~~ [QDP] Integrate Apache Arrow and Parquet for data processing Dec 5, 2025

guan404ming merged commit 64637fa into apache:dev-qdp Dec 5, 2025
2 checks passed

rich7420 approved these changes Dec 5, 2025

View reviewed changes

guan404ming deleted the integrate-arrow-rs branch December 5, 2025 15:15

rich7420 mentioned this pull request Dec 5, 2025

[QDP] Integrate Apache Arrow & Parquet #683

Closed

[QDP] Integrate Apache Arrow and Parquet for data processing #680

[QDP] Integrate Apache Arrow and Parquet for data processing #680

Uh oh!

Conversation

guan404ming commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose of PR

Related Issues or PRs

Changes Made

Breaking Changes

Checklist

Uh oh!

rich7420 commented Dec 4, 2025

Uh oh!

guan404ming commented Dec 4, 2025

Uh oh!

400Ping Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

400Ping Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

rich7420 commented Dec 5, 2025

Uh oh!

400Ping commented Dec 5, 2025

Uh oh!

guan404ming commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ryankert01 Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rich7420 Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryankert01 Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

rich7420 commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guan404ming commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

400Ping left a comment

Choose a reason for hiding this comment

Uh oh!

guan404ming commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guan404ming commented Dec 5, 2025

Uh oh!

rich7420 commented Dec 5, 2025

Uh oh!

guan404ming commented Dec 5, 2025

Uh oh!

Uh oh!

guan404ming commented Dec 5, 2025

Uh oh!

rich7420 commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

guan404ming commented Dec 4, 2025 •

edited

Loading

guan404ming commented Dec 5, 2025 •

edited

Loading

ryankert01 Dec 5, 2025 •

edited

Loading

rich7420 Dec 5, 2025 •

edited

Loading

rich7420 commented Dec 5, 2025 •

edited

Loading

guan404ming commented Dec 5, 2025 •

edited

Loading

guan404ming commented Dec 5, 2025 •

edited

Loading