-
Notifications
You must be signed in to change notification settings - Fork 107
Perf: optimize take_scalar
#5723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
CodSpeed Performance ReportMerging #5723 will improve performances by 19.22%Comparing Summary
Benchmarks breakdown
Footnotes
|
Codecov Report✅ All modified and coverable lines are covered by tests. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
ecd0496 to
5783a6b
Compare
This change optimizes the scalar `take` compute function over slices.
It is pretty crazy how the compiler was not able to figure this out on
its own...
We can look at the codspeed numbers but on my machine this is ~35%
faster:
Original:
<details>
```
├─ pvector_take_uniform │ │ │ │ │
│ ├─ 16 │ │ │ │ │
│ │ ├─ 1000 404.8 ns │ 6.394 µs │ 409.8 ns │ 413.1 ns │ 100000 │ 400000
│ │ ├─ 10000 3.689 µs │ 34.39 µs │ 3.709 µs │ 3.746 µs │ 100000 │ 100000
│ │ ╰─ 100000 36.43 µs │ 124 µs │ 36.63 µs │ 37 µs │ 100000 │ 100000
│ ├─ 256 │ │ │ │ │
│ │ ├─ 1000 404.8 ns │ 20.97 µs │ 409.8 ns │ 419.7 ns │ 100000 │ 400000
│ │ ├─ 10000 3.679 µs │ 81.82 µs │ 3.699 µs │ 3.78 µs │ 100000 │ 100000
│ │ ╰─ 100000 36.45 µs │ 122.8 µs │ 36.54 µs │ 36.95 µs │ 100000 │ 100000
│ ├─ 2048 │ │ │ │ │
│ │ ├─ 1000 409.8 ns │ 30.72 µs │ 429.8 ns │ 433.4 ns │ 100000 │ 100000
│ │ ├─ 10000 3.689 µs │ 21.1 µs │ 3.699 µs │ 3.718 µs │ 100000 │ 100000
│ │ ╰─ 100000 36.51 µs │ 124.3 µs │ 36.79 µs │ 37.2 µs │ 100000 │ 100000
│ ╰─ 8192 │ │ │ │ │
│ ├─ 1000 419.8 ns │ 5.489 µs │ 434.8 ns │ 437.4 ns │ 100000 │ 200000
│ ├─ 10000 3.699 µs │ 38.87 µs │ 3.749 µs │ 3.793 µs │ 100000 │ 100000
│ ╰─ 100000 36.55 µs │ 122.1 µs │ 36.95 µs │ 37.19 µs │ 100000 │ 100000
╰─ pvector_take_zipfian │ │ │ │ │
├─ 16 │ │ │ │ │
│ ├─ 1000 404.8 ns │ 2.012 µs │ 407.3 ns │ 409.6 ns │ 100000 │ 400000
│ ├─ 10000 3.689 µs │ 30.16 µs │ 3.719 µs │ 3.729 µs │ 100000 │ 100000
│ ╰─ 100000 36.53 µs │ 125.6 µs │ 36.78 µs │ 37.06 µs │ 100000 │ 100000
├─ 256 │ │ │ │ │
│ ├─ 1000 409.8 ns │ 4.949 µs │ 414.8 ns │ 415.4 ns │ 100000 │ 200000
│ ├─ 10000 3.689 µs │ 29.16 µs │ 3.719 µs │ 3.731 µs │ 100000 │ 100000
│ ╰─ 100000 36.52 µs │ 122.6 µs │ 36.77 µs │ 37 µs │ 100000 │ 100000
├─ 2048 │ │ │ │ │
│ ├─ 1000 399.8 ns │ 6.302 µs │ 404.8 ns │ 428.2 ns │ 100000 │ 400000
│ ├─ 10000 3.689 µs │ 25.36 µs │ 3.699 µs │ 3.736 µs │ 100000 │ 100000
│ ╰─ 100000 36.5 µs │ 121.6 µs │ 36.72 µs │ 36.96 µs │ 100000 │ 100000
╰─ 8192 │ │ │ │ │
├─ 1000 407.3 ns │ 1.914 µs │ 412.3 ns │ 415.2 ns │ 100000 │ 400000
├─ 10000 3.689 µs │ 13.83 µs │ 3.729 µs │ 3.744 µs │ 100000 │ 100000
╰─ 100000 36.47 µs │ 126.8 µs │ 37.41 µs │ 37.79 µs │ 100000 │ 100000
```
</details>
New:
<details>
```
take_primitive fastest │ slowest │ median │ mean │ samples │ iters
├─ pvector_take_uniform │ │ │ │ │
│ ├─ 16 │ │ │ │ │
│ │ ├─ 1000 279.8 ns │ 15.9 µs │ 299.8 ns │ 300.4 ns │ 100000 │ 100000
│ │ ├─ 10000 2.529 µs │ 22.95 µs │ 2.569 µs │ 2.592 µs │ 100000 │ 100000
│ │ ╰─ 100000 23.32 µs │ 115.6 µs │ 23.66 µs │ 23.95 µs │ 100000 │ 100000
│ ├─ 256 │ │ │ │ │
│ │ ├─ 1000 257.3 ns │ 1.302 µs │ 264.8 ns │ 267.2 ns │ 100000 │ 400000
│ │ ├─ 10000 2.519 µs │ 42.61 µs │ 2.569 µs │ 2.606 µs │ 100000 │ 100000
│ │ ╰─ 100000 23.59 µs │ 117.2 µs │ 23.71 µs │ 24.03 µs │ 100000 │ 100000
│ ├─ 2048 │ │ │ │ │
│ │ ├─ 1000 262.3 ns │ 3.857 µs │ 267.3 ns │ 270.4 ns │ 100000 │ 400000
│ │ ├─ 10000 2.559 µs │ 19.81 µs │ 2.599 µs │ 2.631 µs │ 100000 │ 100000
│ │ ╰─ 100000 23.48 µs │ 111 µs │ 23.62 µs │ 23.99 µs │ 100000 │ 100000
│ ╰─ 8192 │ │ │ │ │
│ ├─ 1000 292.3 ns │ 6.382 µs │ 302.3 ns │ 304.2 ns │ 100000 │ 400000
│ ├─ 10000 2.749 µs │ 21.31 µs │ 2.799 µs │ 2.819 µs │ 100000 │ 100000
│ ╰─ 100000 25.28 µs │ 118.3 µs │ 25.56 µs │ 25.9 µs │ 100000 │ 100000
╰─ pvector_take_zipfian │ │ │ │ │
├─ 16 │ │ │ │ │
│ ├─ 1000 257.3 ns │ 5.714 µs │ 264.8 ns │ 267.7 ns │ 100000 │ 400000
│ ├─ 10000 2.519 µs │ 20.1 µs │ 2.569 µs │ 2.593 µs │ 100000 │ 100000
│ ╰─ 100000 23.43 µs │ 105.6 µs │ 23.6 µs │ 23.87 µs │ 100000 │ 100000
├─ 256 │ │ │ │ │
│ ├─ 1000 259.8 ns │ 1.259 µs │ 264.8 ns │ 267.7 ns │ 100000 │ 400000
│ ├─ 10000 2.509 µs │ 36.64 µs │ 2.569 µs │ 2.763 µs │ 100000 │ 100000
│ ╰─ 100000 23.2 µs │ 107 µs │ 23.59 µs │ 23.87 µs │ 100000 │ 100000
├─ 2048 │ │ │ │ │
│ ├─ 1000 259.8 ns │ 1.957 µs │ 267.3 ns │ 275.8 ns │ 100000 │ 400000
│ ├─ 10000 2.569 µs │ 9.469 µs │ 2.609 µs │ 2.644 µs │ 100000 │ 100000
│ ╰─ 100000 23.72 µs │ 109.6 µs │ 23.99 µs │ 24.27 µs │ 100000 │ 100000
╰─ 8192 │ │ │ │ │
├─ 1000 269.8 ns │ 26.15 µs │ 284.8 ns │ 289.4 ns │ 100000 │ 200000
├─ 10000 2.709 µs │ 8.879 µs │ 2.739 µs │ 2.751 µs │ 100000 │ 100000
╰─ 100000 24.27 µs │ 107.2 µs │ 24.53 µs │ 24.74 µs │ 100000 │ 100000
```
</details>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
This change optimizes the scalar
takecompute function over slices.I wanted to optimize this before I start benchmarking the hand-written SIMD code we will bring back in #5722
It is pretty crazy how the compiler was not able to figure this out on its own...
We can look at the codspeed numbers but on my machine (AMD 7950X) this is ~10% faster:
Original:
Details
New:
Details
Edit: Accidentally used the wrong numbers, seems to be about 10% faster in general