Use harmonic mean to calculate IPS by jhawthorn · Pull Request #149 · evanphx/benchmark-ips

jhawthorn · 2026-03-18T18:03:56Z

We pass an array of "samples" to Stats::SD. Previously an arithmetic mean was used on those samples. Because these are rates (instructions per second), we should use a harmonic mean instead.

That's a little abstract for me so here's an example:

require "benchmark/ips"

i = 0
Benchmark.ips do |x|
  x.report "100 ips" do
    sleep 1 if i % 100 == 0
    i += 1
  end

  x.report "1000 ips" do
    sleep 0.001
    i += 1
  end
end

The first benchmark has a high variance, 1/100 iterations takes 1 second, otherwise it's nearly instant. I think everyone would agree that's 100 i/s. The second has consistent timing, and every iteration takes 1ms. I think everyone would agree that's 1000 i/s.

However when using the arithmetic mean, we're incorrectly biased towards the fast measurements.

Before (arithmetic mean):

ruby 4.0.1 (2026-01-13 revision e04267a14b) +PRISM [x86_64-linux]
Warming up --------------------------------------
             100 ips     1.000 i/100ms
            1000 ips    95.000 i/100ms
Calculating -------------------------------------
             100 ips      5.826M (±16.8%) i/s  (171.64 ns/i) -    459.000 in   5.000456s
            1000 ips    949.996 (± 0.0%) i/s    (1.05 ms/i) -      4.750k in   5.000024s

After (harmonic mean):

ruby 4.0.1 (2026-01-13 revision e04267a14b) +PRISM [x86_64-linux]
Warming up --------------------------------------
             100 ips     1.000 i/100ms
            1000 ips    95.000 i/100ms
Calculating -------------------------------------
             100 ips     91.778 (±6617320.3%) i/s   (10.90 ms/i) -    459.000 in   5.001200s
            1000 ips    950.020 (± 0.0%) i/s    (1.05 ms/i) -      4.845k in   5.099893s

This does make the stats calculation essentially a complicated way to calculate total_iterations / total_time, which is the correct overall rate.

We pass an array of "samples" to Stats::SD. Previously an arithmetic mean was used on those samples. Because these are rate, we should use a harmonic mean instead. That's a little abstract for me so here's an example: require "benchmark/ips" i = 0 Benchmark.ips do |x| x.report "100 ips" do sleep 1 if i % 100 == 0 i += 1 end x.report "1000 ips" do sleep 0.001 i += 1 end end The first benchmark has a high variance, 1/100 iterations takes 1 second, otherwise it's nearly instant. I think everyone would agree that is 100 i/s. The second has consistent timing, and every iteration takes 1ms. I think everyone would agree that's 1000 i/s. However when using the arithmetic mean, we're incorrectly biased towards the fast measurements. Before (arithmetic mean): ruby 4.0.1 (2026-01-13 revision e04267a14b) +PRISM [x86_64-linux] Warming up -------------------------------------- 100 ips 1.000 i/100ms 1000 ips 95.000 i/100ms Calculating ------------------------------------- 100 ips 5.826M (±16.8%) i/s (171.64 ns/i) - 459.000 in 5.000456s 1000 ips 949.996 (± 0.0%) i/s (1.05 ms/i) - 4.750k in 5.000024s After (harmonic mean): ruby 4.0.1 (2026-01-13 revision e04267a14b) +PRISM [x86_64-linux] Warming up -------------------------------------- 100 ips 1.000 i/100ms 1000 ips 95.000 i/100ms Calculating ------------------------------------- 100 ips 91.778 (±6617320.3%) i/s (10.90 ms/i) - 459.000 in 5.001200s 1000 ips 950.020 (± 0.0%) i/s (1.05 ms/i) - 4.845k in 5.099893s This does make the stats calculation essentially a complicated way to calculate total_iterations / total_time, which is the correct overall rate.

eregon · 2026-03-18T19:52:50Z

lib/benchmark/timing.rb

+    # @param [Array] samples Samples to calculate mean.
+    # @return [Float] Harmonic mean of given samples.
+    def self.harmonic_mean(samples)
+      reciprocals = samples.inject(0) { |acc, i| acc + (1.0 / i) }


Suggested change

reciprocals = samples.inject(0) { |acc, i| acc + (1.0 / i) }

reciprocals = samples.sum { |i| (1.0 / i) }

would be better as it reduces errors from floating-point operations, while + accumulates errors.
But Array#sum was added in 2.4, and the CI supports 2.3 so it should be conditional then.

eregon · 2026-03-18T19:53:54Z

Wow, amazing find, I sometimes saw that total_iterations in total_time like 459.000 in 5.001200s didn't seem to match the computed i/s but didn't have time to look further into it (and in hindsight I feel I should really have at least filed a bug 😅).

eregon · 2026-03-18T19:55:23Z

This does make the stats calculation essentially a complicated way to calculate total_iterations / total_time, which is the correct overall rate.

Could we actually compute it like total_iterations / total_time?
That's trivially correct and it is far less prone to floating-point operation errors (1.0 / i is probably already pretty bad in terms of how much error it introduces).

evanphx · 2026-03-18T20:26:11Z

Great find!

the % in this this report:

100 ips     91.778 (±6617320.3%) i/s   (10.90 ms/i) -    459.000 in   5.001200s

Is so insane as to be useless to any human brain reading it. I'd suggest that when the stddev is pushed levels of incredulity, we should report it like that, perhaps like:

100 ips     91.778 (+?!%) i/s   (10.90 ms/i) -    459.000 in   5.001200s

jhawthorn · 2026-03-19T00:00:55Z

Could we actually compute it like total_iterations / total_time?

Yep! Updated to do that. I think that's the best and most intuitive way to calculate this, though I'm glad to have gone the more complex route first as (at least to me) that helped clarify what the issue was.

I'd suggest that when the stddev is pushed levels of incredulity, we should report it like that, perhaps like:

I had the same thought opening this PR. How about (± Inf%), which makes it clear it's a very large number rather than just completely unknown:

             100 ips     42.834 (± Inf%) i/s   (23.35 ms/i) -    215.000 in   5.019397s
            1000 ips    792.174 (± 0.5%) i/s    (1.26 ms/i) -      4.029k in   5.086003s

I did try (± ∞ %), (± ?!%), and (± ⚠️ %) but they didn't feel quite right

eregon · 2026-03-19T09:19:47Z

Yep! Updated to do that

Great, thanks.

Is so insane as to be useless to any human brain reading it.

I think it'd still be useful to get a sense of how unstable it is.
Some benchmarks are very unstable and might have 100% stddev, for those I would want to see the number and hiding it with Inf is counter-productive.

For the vast majority of benchmarks it'd be < 100% though, and there it's fine anyway.

Also the number is so big in this case because it's stddev, which is known as a non-robust estimator and so easily swayed a lot by a few outliers (which is common in benchmarks due to GC, etc).

So I think it's best to not change that. At least I see no need for it in most cases, and for the cases it does happen it's good to be able to see the number to quantify how unstable it is.

eregon reviewed Mar 18, 2026

View reviewed changes

jhawthorn added 2 commits March 18, 2026 14:08

Calculate IPS as just total_iterations / total_time

d9823a1

Display +/- Inf% for errors > 99.9%

09d469d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use harmonic mean to calculate IPS#149

Use harmonic mean to calculate IPS#149
jhawthorn wants to merge 3 commits intoevanphx:masterfrom
jhawthorn:harmonic_mean

jhawthorn commented Mar 18, 2026 •

edited

Loading

Uh oh!

eregon Mar 18, 2026

Uh oh!

eregon commented Mar 18, 2026 •

edited

Loading

Uh oh!

eregon commented Mar 18, 2026

Uh oh!

evanphx commented Mar 18, 2026

Uh oh!

jhawthorn commented Mar 19, 2026

Uh oh!

eregon commented Mar 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	reciprocals = samples.inject(0) { \|acc, i\| acc + (1.0 / i) }
	reciprocals = samples.sum { \|i\| (1.0 / i) }

Conversation

jhawthorn commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eregon Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

eregon commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eregon commented Mar 18, 2026

Uh oh!

evanphx commented Mar 18, 2026

Uh oh!

jhawthorn commented Mar 19, 2026

Uh oh!

eregon commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jhawthorn commented Mar 18, 2026 •

edited

Loading

eregon commented Mar 18, 2026 •

edited

Loading

eregon commented Mar 19, 2026 •

edited

Loading