-
Notifications
You must be signed in to change notification settings - Fork 969
[QDP] Integrate Apache Arrow and Parquet for data processing #680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
77ff73f to
8ba2362
Compare
8ba2362 to
57ff8ee
Compare
57ff8ee to
edee2a3
Compare
|
Thanks for the patch @guan404ming !!! |
Nice suggestion, I've updated with much more optimized version. Thanks! |
qdp/qdp-core/src/io.rs
Outdated
| )])); | ||
|
|
||
| // Create Float64Array from slice | ||
| let array = Float64Array::from(Vec::from(data)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use Float64Array::from_iter_values(data.iter().copied()) to avoid the extra allocation.
qdp/qdp-core/src/io.rs
Outdated
| pub fn read_parquet_to_arrow<P: AsRef<Path>>(path: P) -> Result<Float64Array> { | ||
| let data = read_parquet(path)?; | ||
| Ok(Float64Array::from(data)) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Directly constructing an Arrow array via ParquetRecordBatchReader would avoid an extra copy.
|
Looks good, but the current implementation forces a memory copy (Vec allocation) even when we want to use Arrow directly. We should refactor io.rs so that read_parquet_to_arrow is the base implementation, ensuring true zero-copy performance for the pipeline. |
I agree with @rich7420 as well. |
I think you're right, thanks for pointing out. I've updated the implementation to only. |
qdp/qdp-core/tests/parquet_io.rs
Outdated
| } | ||
|
|
||
| #[test] | ||
| fn test_chunked_zero_copy_api() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's hard but should we verify if the mem address are the same? Looks like it only do value check.
Might be something like this: (which only ensure the pointer's in mmap)
let chunk_ptr = chunks[0].values().as_ptr() as *const u8;
assert!(chunk_ptr >= mmap_ptr);
assert!(chunk_ptr < unsafe { mmap_ptr.add(mmap_len) });There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ryankert01 you're right. It seems current test verifies data integrity but not the zero-copy behavior. I think that verifying memory addresses strictly requires unsafe access to the underlying Arrow buffer pointers. We should add a test case that inspects array.values().as_ptr() to ensure it aligns with expectations, though fully verifying mmap alignment is tricky in tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be a followup too
|
@guan404ming Great work on the io refactor! Thanks! |
I previously plan to send following PR for this but forget to add in PR description. You definitely could help with this, thanks! |
400Ping
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OverAll LGTM
|
After our offline discussion and investigation, we confirmed that Parquet cannot achieve true zero-copy from disk to memory because its data is stored in compressed and encoded form and must be decoded before use. We’ll continue keeping Parquet for now given its convenience and practicality. Also I’ve updated the TODO to migrate the encoder to a chunk-based API. I would follow up on Arrow IPC as a potential path toward a real zero-copy data pipeline. |
|
Sure! I will improve the path of RAM -> GPU on #689 . |
|
I'm going to merge this, please feel free to open PR to refine this. Thanks for all review and discussion! |
|
LOL |
|
Speed as Rocket. |
Purpose of PR
Integrates Apache Arrow to enable efficient columnar data processing for quantum encoding operations. This addition provides a standardized path for handling structured data inputs alongside the existing raw &[f64] interface.
Related Issues or PRs
Changes Made
Breaking Changes
Checklist