Skip to content

[v2] Add more granular metrics to benchmarking framework: plugin-load time, static imports time, client creation time.#10292

Open
aemous wants to merge 9 commits into
aws:v2from
aemous:plugin-load-time-perf
Open

[v2] Add more granular metrics to benchmarking framework: plugin-load time, static imports time, client creation time.#10292
aemous wants to merge 9 commits into
aws:v2from
aemous:plugin-load-time-perf

Conversation

@aemous
Copy link
Copy Markdown
Contributor

@aemous aemous commented May 8, 2026

Description of changes:

  • The benchmarking framework (callable from ./scripts/performance/run-benchmarks) now emits plugin-load time, static imports time, and client creation time.
  • Fixed bug with performance framework on IMDS-compatible instances where the CLI would attempt to reach the IMDS endpoints to retrieve credentials but the stubbed responses did not support this. Now, we supply mock credentials in the stubbed config files in each of our stubbed JSON benchmarks.
  • In vendored botocore, emit before-create-ciient and after-create-client before and after client creation, respectively.
  • In create_clidriver, add a new optional event_hooks argument to allow callers to provide their own event emitter. This change was needed so the benchmark framework can register against the load_plugins event, which is emitted during create_clidriver.
  • Modified --debug-dir parameter in run-benchmarks script so that it supports relative paths (previously it only supported absolute paths).

Description of tests:

  • Manually ran the benchmarking framework locally and on an IMDS-compatible EC2 instance, and verified the numbers are successfully being emitted, and are in an expected range aligned with previous benchmarking efforts.
  • Manually tested the --debug-dir parameter with relative and absolute paths, and verified expected behavior in both cases.
  • Successfully ran an internal pre-prod build workflow.

Example plugin-load time emission:

{
      "name": "cloudwatch.getmetricdata.plugins.import.time",
      "description": "Total time spent loading all built-in plugins.",
      "unit": "Seconds",
      "dimensions": [],
      "measurements": [
        0.08212709426879883
      ]
}

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@aemous aemous requested a review from a team May 8, 2026 13:47
@hssyoo hssyoo assigned hssyoo and unassigned hssyoo May 20, 2026
@hssyoo hssyoo self-requested a review May 20, 2026 18:27
Copy link
Copy Markdown
Contributor

@hssyoo hssyoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine but the static import measurement seems a bit dubious. I'd like to understand how we plan on using this measurement, what we expect it to catch, how we would react.

Comment on lines +287 to +292
before_imports = start_time
# We import from awscli lazily to ensure import time is measured in
# total runtime.
from awscli.botocore.hooks import HierarchicalEmitter
from awscli.clidriver import AWSCLIEntryPoint, create_clidriver
after_imports = time.time()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are we measuring here exactly?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the time to import all modules used by the AWS CLI at runtime. It's similar to the granular benchmarking that you previously implemented where you measured "Python Library Imports."

It let's us measure the total amount of time spent during the AWS CLI runtime (i.e. after interpreter started) that is spent importing modules, before we start running any initialization code.

Looks fine but the static import measurement seems a bit dubious. I'd like to understand how we plan on using this measurement, what we expect it to catch, how we would react.

It is largely prompted by our previous observation that prompt_toolkit took up a significant fraction of runtime. It embeds the philosophy "if module imports previously took up too much runtime, we should monitor the metric over time so we don't lose track of it, to prevent future regressions."

During my dev builds, we're observing 119 ms of time taken during static imports. This starts off as a baseline, that we track changes in over time.

It also automatically implements the measure that we previously did ad-hoc. When we benchmarked it ad-hoc we found useful results (particularly hefty imports). So in my view, if it was useful once, then it should be useful to track in the long term.

"What we expect it to catch, how we would react"

For the most part, it should catch large regressions. If we notice this number change significantly, it would prompt us to investigate the regression, and consider refactoring code (e.g. lazy initialization). Overall, this should help us reduce the time between (A) unintentionally increasing command execution time due to a new hefty import and (B) allocating time to investigate and mitigate the regression.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants