-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Speed up binary kernels by XXX%, add BooleanBuffer::from_bitwise_binary
#9022
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
04045d2 to
bedcfd6
Compare
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark boolean_kernels |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark boolean_kernels |
|
🤖 |
|
🤖: Benchmark completed Details
|
not kernels by XXX%, add BooleanBuffer::from_bitwise_binaryBooleanBuffer::from_bitwise_binary
|
run benchmark boolean_kernels |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark boolean_kernels |
|
🤖 |
|
🤖: Benchmark completed Details
|
| F: FnMut(u64, u64) -> u64, | ||
| { | ||
| // Safety: all valid bytes are valid u64s | ||
| let (left_prefix, left_u64s, left_suffix) = unsafe { left.align_to::<u64>() }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know how much of a difference the full u64 alignment really makes on current target architectures? I think unaligned loads on x86 are not really slower. Maybe only if they cross a cache line, which should happen every 8th load. With that assumption, using slice::as_chunks and u64::from_le_bytes on each chunk might be simpler, and would then only require handling a suffix.
Staying with align_to and bailing on non-empty prefixes should also be fine, I would expect buffers to be aligned most of the time, but having a byte length that is a multiple of 8 might be less likely.
Other than this nit the code looks good to me 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TLDR is I don't really know how much of a difference this makes, and I am struggling with measuring the difference
Right now, the benchmarks seems super sensitive to any change. For example, in the most recent version of this PR I simply moved the code and added the check for alignment. And this results in consistently reproducible slowdowns on some of the sliced bechmarks
and 1.00 211.5±1.92ns ? ?/sec 1.30 275.5±4.61ns ? ?/sec
and_sliced_1 1.00 1095.6±3.62ns ? ?/sec 1.12 1228.6±12.32ns ? ?/sec
and_sliced_24 1.39 337.9±3.92ns ? ?/sec 1.00 243.4±1.74ns ? ?/sec
With that assumption, using slice::as_chunks and u64::from_le_bytes on each chunk might be simpler, and would then only require handling a suffix.
I tried a version like this and will see how it performs
/// Like [`Self::from_bitwise_binary_op`] but optimized for the case where the
/// inputs are aligned to byte boundaries
///
/// Returns `None` if the inputs are not fully u64 aligned
fn try_from_aligned_bitwise_binary_op<F>(
left: &[u8],
right: &[u8],
len_in_bits: usize,
op: &mut F,
) -> Option<Self>
where
F: FnMut(u64, u64) -> u64,
{
// trim to length
let len_in_bytes = ceil(len_in_bits, 8);
let left = &left[0..len_in_bytes];
let right = &right[0..len_in_bytes];
// Safety: all valid bytes are valid u64s
let (left_prefix, left_u64s, left_suffix) = unsafe { left.align_to::<u64>() };
let (right_prefix, right_u64s, right_suffix) = unsafe { right.align_to::<u64>() };
if left_prefix.is_empty()
&& right_prefix.is_empty()
&& left_suffix.is_empty()
&& right_suffix.is_empty()
{
// the buffers are word (64 bit) aligned, so use optimized Vec code.
let result_u64s = left_u64s
.iter()
.zip(right_u64s.iter())
.map(|(l, r)| op(*l, *r))
.collect::<Vec<u64>>();
Some(BooleanBuffer::new(
Buffer::from(result_u64s),
0,
len_in_bits,
))
} else {
let (left_slices, left_remainder) = left.as_chunks::<8>();
let (right_slices, right_remainder) = right.as_chunks::<8>();
debug_assert_eq!(left_slices.len(), right_slices.len());
debug_assert_eq!(left_remainder.len(), right_remainder.len());
let mut mutable_result = MutableBuffer::with_capacity(left_slices.len() * 8 + left_remainder.len());
mutable_result.extend_from_iter(left_slices.iter().zip(right_slices.iter())
.map(|(l,r)| {
op(u64::from_le_bytes(*l),
u64::from_le_bytes(*r))
}));
if !left_remainder.is_empty() {
let rem = op(
u64::from_le_bytes({
let mut bytes = [0u8; 8];
bytes[..left_remainder.len()].copy_from_slice(left_remainder);
bytes
}),
u64::from_le_bytes({
let mut bytes = [0u8; 8];
bytes[..right_remainder.len()].copy_from_slice(right_remainder);
bytes
}),
);
println!("copying {} remainder bytes: {:?}", left_remainder.len(), &rem.to_le_bytes()[..left_remainder.len()]);
mutable_result.extend_from_slice(&rem.to_le_bytes()[..left_remainder.len()]);
}
Some(BooleanBuffer {
buffer: Buffer::from(mutable_result),
offset: 0,
len: len_in_bits,
})
}
}Sadly, it seems like we are not going to be able to use as_chunks
``rust
error: current MSRV (Minimum Supported Rust Version) is 1.85.0 but this item is stable since `1.88.0`
--> arrow-buffer/src/buffer/boolean.rs:335:54
|
335 | let (left_slices, left_remainder) = left.as_chunks::<8>();
| ^^^^^^^^^^^^^^^^
|
= note: you may want to conditionally increase the MSRV considered by Clippy using the `clippy::msrv` attribute
= help: for further information visit https://rust-lang.github.io/rust-clippy/rust-1.91.0/index.html#incompatible_msrv
= note: `-D clippy::incompatible-msrv` implied by `-D warnings`
= help: to override `-D warnings` add `#[allow(clippy::incompatible_msrv)]`
|
run benchmark boolean_kernels |
|
🤖 |
|
I may try to regigger the boolean kernel benchmarks to include multiple allocations / sizes to try and get a better sense of this code |
|
🤖: Benchmark completed Details
|
|
Ok, given that totally specializing the slicing didn't help, I am now 100% convinced that the difference in performance reported for _sliced24 is due to alignment or some other crazy thing. What I plan to do now is:
|
|
I made a PR to try and make the benchmarks more reproducable: |
# Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> - Related to #8806 - Related to #8996 # Rationale for this change When working on improving the boolean kernels, I have seen significant and unexplained noise from run to run. For example, just adding a fast path for `u64` aligned data resulted in a reported 30% regression in the speed of slice24 (code that is not affected by the change at all). for example, from #9022 ``` and 1.00 208.0±5.91ns ? ?/sec 1.34 278.8±10.07ns ? ?/sec and_sliced_1 1.00 1100.2±6.53ns ? ?/sec 1.12 1226.9±6.11ns ? ?/sec and_sliced_24 1.40 340.9±2.49ns ? ?/sec 1.00 243.7±2.13ns ? ?/sec ``` I also can't reproduce this effect locally or when I run the benchmarks individually. Given the above, and the tiny amount of time spent in the benchmark (hundreds of nanoseconds), I believe what is happening is that changing the allocation pattern during the benchmark runs (each kernel allocates output), data for subsequent iterations is allocated subtlety differently (e.g. the exact alignment or some other factor is different). This results in different performance characteristics even when the code has not changed. # What changes are included in this PR? To reduce this noise, I want to change the benchmarks to pre-allocate the input. # Are these changes tested? I ran them manually # Are there any user-facing changes? No, this is just a benchmark change
|
Sadly the as_chunks implementation needs a newer MSRV: + } else {
+ let (left_slices, left_remainder) = left.as_chunks::<8>();
+ let (right_slices, right_remainder) = right.as_chunks::<8>();
+ debug_assert_eq!(left_slices.len(), right_slices.len());
+ debug_assert_eq!(left_remainder.len(), right_remainder.len());
+ let mut mutable_result =
+ MutableBuffer::with_capacity(left_slices.len() * 8 + left_remainder.len());
+ mutable_result.extend_from_iter(
+ left_slices
+ .iter()
+ .zip(right_slices.iter())
+ .map(|(l, r)| op(u64::from_le_bytes(*l), u64::from_le_bytes(*r))),
+ );
+ if !left_remainder.is_empty() {
+ let rem = op(
+ u64::from_le_bytes({
+ let mut bytes = [0u8; 8];
+ bytes[..left_remainder.len()].copy_from_slice(left_remainder);
+ bytes
+ }),
+ u64::from_le_bytes({
+ let mut bytes = [0u8; 8];
+ bytes[..right_remainder.len()].copy_from_slice(right_remainder);
+ bytes
+ }),
+ );
+ mutable_result.extend_from_slice(&rem.to_le_bytes()[..left_remainder.len()]);
+ }
+ BooleanBuffer {
+ buffer: Buffer::from(mutable_result),
+ offset: 0,
+ len: len_in_bits,
+ }
}error: current MSRV (Minimum Supported Rust Version) is |
This reverts commit 03d060b.
|
run benchmark boolean_kernels |
1 similar comment
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
|
🤔 now I have no idea why the slice24 is slower. I will take this back up later this week hopefully |
|
I debugged the problems:
|
Which issue does this PR close?
Buffer::from_bitwise_unaryandBuffer::from_bitwise_binaryme… #8854This is the next step after
notkernel by 50%, addBooleanBuffer::from_bitwise_unary#8996Rationale for this change
Also, it is hard to find the code to create new Buffers by applying bitwise unary operations.
What changes are included in this PR?
BooleanBuffer::from_bitwise_binarybitwise_bin_op_helperAre these changes tested?
Yes new tests are added
Performance results show XXX performance improvements
Are there any user-facing changes?