Skip to content

Use harmonic mean to calculate IPS#149

Open
jhawthorn wants to merge 3 commits intoevanphx:masterfrom
jhawthorn:harmonic_mean
Open

Use harmonic mean to calculate IPS#149
jhawthorn wants to merge 3 commits intoevanphx:masterfrom
jhawthorn:harmonic_mean

Conversation

@jhawthorn
Copy link
Copy Markdown

@jhawthorn jhawthorn commented Mar 18, 2026

We pass an array of "samples" to Stats::SD. Previously an arithmetic mean was used on those samples. Because these are rates (instructions per second), we should use a harmonic mean instead.

That's a little abstract for me so here's an example:

require "benchmark/ips"

i = 0
Benchmark.ips do |x|
  x.report "100 ips" do
    sleep 1 if i % 100 == 0
    i += 1
  end

  x.report "1000 ips" do
    sleep 0.001
    i += 1
  end
end

The first benchmark has a high variance, 1/100 iterations takes 1 second, otherwise it's nearly instant. I think everyone would agree that's 100 i/s. The second has consistent timing, and every iteration takes 1ms. I think everyone would agree that's 1000 i/s.

However when using the arithmetic mean, we're incorrectly biased towards the fast measurements.

Before (arithmetic mean):

ruby 4.0.1 (2026-01-13 revision e04267a14b) +PRISM [x86_64-linux]
Warming up --------------------------------------
             100 ips     1.000 i/100ms
            1000 ips    95.000 i/100ms
Calculating -------------------------------------
             100 ips      5.826M (±16.8%) i/s  (171.64 ns/i) -    459.000 in   5.000456s
            1000 ips    949.996 (± 0.0%) i/s    (1.05 ms/i) -      4.750k in   5.000024s

After (harmonic mean):

ruby 4.0.1 (2026-01-13 revision e04267a14b) +PRISM [x86_64-linux]
Warming up --------------------------------------
             100 ips     1.000 i/100ms
            1000 ips    95.000 i/100ms
Calculating -------------------------------------
             100 ips     91.778 (±6617320.3%) i/s   (10.90 ms/i) -    459.000 in   5.001200s
            1000 ips    950.020 (± 0.0%) i/s    (1.05 ms/i) -      4.845k in   5.099893s

This does make the stats calculation essentially a complicated way to calculate total_iterations / total_time, which is the correct overall rate.

We pass an array of "samples" to Stats::SD. Previously an arithmetic
mean was used on those samples. Because these are rate, we should use a
harmonic mean instead.

That's a little abstract for me so here's an example:

    require "benchmark/ips"

    i = 0
    Benchmark.ips do |x|
      x.report "100 ips" do
        sleep 1 if i % 100 == 0
        i += 1
      end

      x.report "1000 ips" do
        sleep 0.001
        i += 1
      end
    end

The first benchmark has a high variance, 1/100 iterations takes 1
second, otherwise it's nearly instant. I think everyone would agree that
is 100 i/s.

The second has consistent timing, and every iteration takes 1ms. I think
everyone would agree that's 1000 i/s.

However when using the arithmetic mean, we're incorrectly biased towards
the fast measurements.

Before (arithmetic mean):

    ruby 4.0.1 (2026-01-13 revision e04267a14b) +PRISM [x86_64-linux]
    Warming up --------------------------------------
                 100 ips     1.000 i/100ms
                1000 ips    95.000 i/100ms
    Calculating -------------------------------------
                 100 ips      5.826M (±16.8%) i/s  (171.64 ns/i) -    459.000 in   5.000456s
                1000 ips    949.996 (± 0.0%) i/s    (1.05 ms/i) -      4.750k in   5.000024s

After (harmonic mean):

    ruby 4.0.1 (2026-01-13 revision e04267a14b) +PRISM [x86_64-linux]
    Warming up --------------------------------------
                 100 ips     1.000 i/100ms
                1000 ips    95.000 i/100ms
    Calculating -------------------------------------
                 100 ips     91.778 (±6617320.3%) i/s   (10.90 ms/i) -    459.000 in   5.001200s
                1000 ips    950.020 (± 0.0%) i/s    (1.05 ms/i) -      4.845k in   5.099893s

This does make the stats calculation essentially a complicated way to
calculate total_iterations / total_time, which is the correct overall
rate.
# @param [Array] samples Samples to calculate mean.
# @return [Float] Harmonic mean of given samples.
def self.harmonic_mean(samples)
reciprocals = samples.inject(0) { |acc, i| acc + (1.0 / i) }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
reciprocals = samples.inject(0) { |acc, i| acc + (1.0 / i) }
reciprocals = samples.sum { |i| (1.0 / i) }

would be better as it reduces errors from floating-point operations, while + accumulates errors.
But Array#sum was added in 2.4, and the CI supports 2.3 so it should be conditional then.

@eregon
Copy link
Copy Markdown
Contributor

eregon commented Mar 18, 2026

Wow, amazing find, I sometimes saw that total_iterations in total_time like 459.000 in 5.001200s didn't seem to match the computed i/s but didn't have time to look further into it (and in hindsight I feel I should really have at least filed a bug 😅).

@eregon
Copy link
Copy Markdown
Contributor

eregon commented Mar 18, 2026

This does make the stats calculation essentially a complicated way to calculate total_iterations / total_time, which is the correct overall rate.

Could we actually compute it like total_iterations / total_time?
That's trivially correct and it is far less prone to floating-point operation errors (1.0 / i is probably already pretty bad in terms of how much error it introduces).

@evanphx
Copy link
Copy Markdown
Owner

evanphx commented Mar 18, 2026

Great find!

the % in this this report:

100 ips     91.778 (±6617320.3%) i/s   (10.90 ms/i) -    459.000 in   5.001200s

Is so insane as to be useless to any human brain reading it. I'd suggest that when the stddev is pushed levels of incredulity, we should report it like that, perhaps like:

100 ips     91.778 (+?!%) i/s   (10.90 ms/i) -    459.000 in   5.001200s

@jhawthorn
Copy link
Copy Markdown
Author

Could we actually compute it like total_iterations / total_time?

Yep! Updated to do that. I think that's the best and most intuitive way to calculate this, though I'm glad to have gone the more complex route first as (at least to me) that helped clarify what the issue was.

I'd suggest that when the stddev is pushed levels of incredulity, we should report it like that, perhaps like:

I had the same thought opening this PR. How about (± Inf%), which makes it clear it's a very large number rather than just completely unknown:

             100 ips     42.834 (± Inf%) i/s   (23.35 ms/i) -    215.000 in   5.019397s
            1000 ips    792.174 (± 0.5%) i/s    (1.26 ms/i) -      4.029k in   5.086003s

I did try (± ∞ %), (± ?!%), and (± ⚠️ %) but they didn't feel quite right

@eregon
Copy link
Copy Markdown
Contributor

eregon commented Mar 19, 2026

Yep! Updated to do that

Great, thanks.

Is so insane as to be useless to any human brain reading it.

I think it'd still be useful to get a sense of how unstable it is.
Some benchmarks are very unstable and might have 100% stddev, for those I would want to see the number and hiding it with Inf is counter-productive.

For the vast majority of benchmarks it'd be < 100% though, and there it's fine anyway.

Also the number is so big in this case because it's stddev, which is known as a non-robust estimator and so easily swayed a lot by a few outliers (which is common in benchmarks due to GC, etc).

So I think it's best to not change that. At least I see no need for it in most cases, and for the cases it does happen it's good to be able to see the number to quantify how unstable it is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants