Conversation
We pass an array of "samples" to Stats::SD. Previously an arithmetic
mean was used on those samples. Because these are rate, we should use a
harmonic mean instead.
That's a little abstract for me so here's an example:
require "benchmark/ips"
i = 0
Benchmark.ips do |x|
x.report "100 ips" do
sleep 1 if i % 100 == 0
i += 1
end
x.report "1000 ips" do
sleep 0.001
i += 1
end
end
The first benchmark has a high variance, 1/100 iterations takes 1
second, otherwise it's nearly instant. I think everyone would agree that
is 100 i/s.
The second has consistent timing, and every iteration takes 1ms. I think
everyone would agree that's 1000 i/s.
However when using the arithmetic mean, we're incorrectly biased towards
the fast measurements.
Before (arithmetic mean):
ruby 4.0.1 (2026-01-13 revision e04267a14b) +PRISM [x86_64-linux]
Warming up --------------------------------------
100 ips 1.000 i/100ms
1000 ips 95.000 i/100ms
Calculating -------------------------------------
100 ips 5.826M (±16.8%) i/s (171.64 ns/i) - 459.000 in 5.000456s
1000 ips 949.996 (± 0.0%) i/s (1.05 ms/i) - 4.750k in 5.000024s
After (harmonic mean):
ruby 4.0.1 (2026-01-13 revision e04267a14b) +PRISM [x86_64-linux]
Warming up --------------------------------------
100 ips 1.000 i/100ms
1000 ips 95.000 i/100ms
Calculating -------------------------------------
100 ips 91.778 (±6617320.3%) i/s (10.90 ms/i) - 459.000 in 5.001200s
1000 ips 950.020 (± 0.0%) i/s (1.05 ms/i) - 4.845k in 5.099893s
This does make the stats calculation essentially a complicated way to
calculate total_iterations / total_time, which is the correct overall
rate.
lib/benchmark/timing.rb
Outdated
| # @param [Array] samples Samples to calculate mean. | ||
| # @return [Float] Harmonic mean of given samples. | ||
| def self.harmonic_mean(samples) | ||
| reciprocals = samples.inject(0) { |acc, i| acc + (1.0 / i) } |
There was a problem hiding this comment.
| reciprocals = samples.inject(0) { |acc, i| acc + (1.0 / i) } | |
| reciprocals = samples.sum { |i| (1.0 / i) } |
would be better as it reduces errors from floating-point operations, while + accumulates errors.
But Array#sum was added in 2.4, and the CI supports 2.3 so it should be conditional then.
|
Wow, amazing find, I sometimes saw that |
Could we actually compute it like |
|
Great find! the % in this this report: Is so insane as to be useless to any human brain reading it. I'd suggest that when the stddev is pushed levels of incredulity, we should report it like that, perhaps like: |
Yep! Updated to do that. I think that's the best and most intuitive way to calculate this, though I'm glad to have gone the more complex route first as (at least to me) that helped clarify what the issue was.
I had the same thought opening this PR. How about I did try |
Great, thanks.
I think it'd still be useful to get a sense of how unstable it is. For the vast majority of benchmarks it'd be < 100% though, and there it's fine anyway. Also the number is so big in this case because it's stddev, which is known as a non-robust estimator and so easily swayed a lot by a few outliers (which is common in benchmarks due to GC, etc). So I think it's best to not change that. At least I see no need for it in most cases, and for the cases it does happen it's good to be able to see the number to quantify how unstable it is. |
We pass an array of "samples" to
Stats::SD. Previously an arithmetic mean was used on those samples. Because these are rates (instructions per second), we should use a harmonic mean instead.That's a little abstract for me so here's an example:
The first benchmark has a high variance, 1/100 iterations takes 1 second, otherwise it's nearly instant. I think everyone would agree that's 100 i/s. The second has consistent timing, and every iteration takes 1ms. I think everyone would agree that's 1000 i/s.
However when using the arithmetic mean, we're incorrectly biased towards the fast measurements.
Before (arithmetic mean):
After (harmonic mean):
This does make the stats calculation essentially a complicated way to calculate total_iterations / total_time, which is the correct overall rate.