Skip to content

Add CI timing and failure analysis script#474

Draft
cvolkcvolk wants to merge 8 commits intomainfrom
cvolk/ci-analysis-script
Draft

Add CI timing and failure analysis script#474
cvolkcvolk wants to merge 8 commits intomainfrom
cvolk/ci-analysis-script

Conversation

@cvolkcvolk
Copy link
Collaborator

@cvolkcvolk cvolkcvolk commented Mar 6, 2026

Summary

Adds scripts/ci_analysis.py, a standalone tool for analysing GitHub Actions CI health for this repo. It fetches run and job data from the GitHub API and produces:

  • Per-job queue time and duration stats (mean, median, p90, stdev) over any lookback window
  • Weekly breakdown tables with median, p90, and sample counts per job, plus trend arrows comparing the last two completed weeks
  • Failure analysis: total failure rate, infra vs genuine classification with per-run detail (failing job + step name)
  • Optional trend plots (--plot) saved as PNGs
  • Raw data saved to JSON for further analysis

Usage:

# Text report for the last 6 weeks
python3 scripts/ci_analysis.py --since 2026-01-23

# With trend plots
python3 scripts/ci_analysis.py --since 2026-01-23 --plot
  QUEUE TIME — Median (min)
  Job                          W04     W05     W06     W07     W08     W09
  ──────────────────────────────────────────────────────────────────────────
  Pre-commit                   1.8     0.8     0.2     0.7     0.8     2.9  ↑
  Run tests                    0.0     0.0     0.0     0.0     0.0     0.6  ↑
  Policy tests (GR00T)         2.0     0.0     0.0     0.0     0.0     1.1  ↑
  Build docs                   2.6     0.1     0.0     0.0     0.0     0.2  →
  Build & push image           0.0     0.0     0.0     0.0     0.0     0.0  →
    Total queue                7.2     5.4     5.9     4.4     3.3     8.6  ↑

  QUEUE TIME — p90   (min)
  Job                          W04     W05     W06     W07     W08     W09
  ──────────────────────────────────────────────────────────────────────────
  Pre-commit                  12.6     9.6    13.1    19.3     8.6    31.1  ↑
  Run tests                   12.2     6.8     8.5     5.5     3.0    20.6  ↑
  Policy tests (GR00T)        14.9     9.5     6.4     5.8     4.6    13.3  ↑
  Build docs                  13.7    14.9    10.9     8.9     6.8    20.0  ↑
  Build & push image           0.0     0.0     0.0     0.0     0.0     0.0  →
    Total queue               60.1    33.2    31.1    39.0    22.4    68.7  ↑

  QUEUE TIME — n (samples)
  Job                          W04     W05     W06     W07     W08     W09
  ──────────────────────────────────────────────────────────────────────────
  Pre-commit                    23     133     104      95      42      87
  Run tests                     23     141     115      96      44      91
  Policy tests (GR00T)          15     136     110      97      42      97
  Build docs                    15     133     109      95      42      92
  Build & push image            15     142     116      97      45     101

  DURATION — Median (min)
  Job                          W04     W05     W06     W07     W08     W09
  ──────────────────────────────────────────────────────────────────────────
  Pre-commit                   0.9     0.8     0.9     0.9     1.2     1.3  →
  Run tests                   24.7    26.0    35.2    38.2    38.6    42.6  ↑
  Policy tests (GR00T)        28.2    25.7    28.2    32.0    32.1    33.3  ↑
  Build docs                   1.2     0.8     0.6     0.6     0.6     0.6  →
  Build & push image          40.2    29.2    29.4    28.7    31.7    30.0  ↓
    Total wall-clock          47.9    45.8    43.9    43.4    49.1    72.6  ↑

  DURATION — p90   (min)
  Job                          W04     W05     W06     W07     W08     W09
  ──────────────────────────────────────────────────────────────────────────
  Pre-commit                   1.2     0.9     1.0     1.2     1.6     1.9  →
  Run tests                   28.0    28.5    39.5    39.3    39.8    46.7  ↑
  Policy tests (GR00T)        30.9    27.3    33.8    33.4    33.5    34.8  ↑
  Build docs                   1.3     1.3     1.1     0.6     0.6     1.6  ↑
  Build & push image          40.2    29.2    29.5    28.9    31.9    32.8  ↑
    Total wall-clock          93.6    71.5    77.4    92.7    84.0   165.3  ↑

  DURATION — n (samples)
  Job                          W04     W05     W06     W07     W08     W09
  ──────────────────────────────────────────────────────────────────────────
  Pre-commit                    20     108      76      67      31      74
  Run tests                     10      52      39      35      17      56
  Policy tests (GR00T)           8      63      52      46      24      54
  Build docs                    12      93      63      61      29      70
  Build & push image             1       3       4       5       4      10

======================================================================

Current findings (Jan 23 – Mar 6):

  • 30% failure rate across 761 runs; 95% genuine, 5% infra
  • "Run tests" median duration grew from ~25m → ~43m over 6 weeks (likely IsaacLab submodule updates)
  • W09 queue time spike: median total queue 3.3m → 8.6m, p90 22m → 69m

scripts/ci_analysis.py fetches GitHub Actions workflow run and job
data for the CI workflows and produces:
- Per-job queue time (created_at -> started_at) and duration stats
- Per-run total queue time (sum across all PR-gating jobs)
- Weekly trend plots: queue time and duration per job (PNG)
- Failure classification: infrastructure vs. genuine

Default time range is from Jan 1 of the current year to today, so
running the script regularly accumulates a growing year-to-date view.

Usage:
  python3 scripts/ci_analysis.py --plot            # year-to-date
  python3 scripts/ci_analysis.py --since 2026-03-01 --plot

Requires: gh CLI authenticated or GITHUB_TOKEN env var.
Generated PNGs and JSON are added to .gitignore.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
ci_new.yml (GitHub workflow id 238099976) was a temporary rename of
ci.yml in Feb 2026 and no longer exists in the repo. GitHub keeps it
active in its API as long as run history exists, causing the analysis
script to count every run twice.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
Mirrors the existing total queue time line on the queue plot:
the dashed black 'Total' line shows the weekly median of the
sum of PR-gating job durations per run.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
- Duration plot: only count jobs with conclusion==success to avoid
  cancelled/failed jobs biasing the median downward
- Total wall-clock: only record for runs where ALL PR-gating jobs
  succeeded individually, not just run.conclusion==success
- Fix variable shadowing bug: job conclusion was overwriting run
  conclusion in the outer loop (renamed to job_conclusion)
- Remove failure rate bar chart (kept in text report instead)
- Simplify plots to median-only lines; move med/p90/σ breakdown
  into a weekly stats table printed to stdout
- Remove ci_new.yml from WORKFLOW_IDS (short-lived rename, no longer
  in repo, was double-counting all runs)

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
- Exclude the current (partial) ISO week from both plots and the weekly
  stats table, so incomplete data does not distort trends
- Remove the "Total queue" line from the queue time plot; median-of-sums
  is misleading when per-run queues are correlated/bursty; per-job
  medians and the weekly table (med/p90/σ) give a clearer picture

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
Correctness:
- Fix p90 formula in fmt_stats to match pct() — was sv[int(n*0.9)]
  (100th percentile for n=10), now sv[max(0, int(n*p)-1)] via shared
  _pct() helper used in both places
- week_failure_counts["genuine"] was never incremented; now uses
  week_failure_counts[run_week][cat] += 1 to track all categories
- Clarify run_total_times comment: it is unfiltered (all runs with
  valid timestamps); only run_duration_records is restricted to
  fully-successful runs

Robustness:
- gh_get: distinguish permanent 4xx errors (401/403/404/422) from
  transient ones — fail fast with a clear hint instead of retrying;
  also change bare `raise exc` to `raise` to preserve traceback
- fetch_runs_since: add _MAX_PAGES=200 safety cap to prevent infinite
  pagination; check for GitHub error objects in response body
- analyze: count per-run job-fetch failures and print a summary
  warning; sys.exit if every run failed (signals auth/network issue)
- get_token: wrap subprocess.run with timeout=10 and handle
  FileNotFoundError/TimeoutExpired; replace bare except on YAML
  fallback with specific exceptions (ImportError, FileNotFoundError,
  KeyError) so each failure prints a diagnostic message
- main: guard --since date parsing with try/except ValueError and
  route through parser.error(); guard json.dump with try/except OSError

Comment accuracy:
- INFRA_PATTERNS block: "step names only" → "job names and step names"
- classify_failure docstring: same correction
- Module docstring output section: show <prefix> and <output> as
  configurable rather than fixed filenames
- to_minutes sanity cap: mention negative values (clock skew) too
- WORKFLOW_IDS comment: explain why omission == exclusion

Simplification:
- Extract _pct(), _current_iso_week(), _group_records_by_week()
  helpers; print_report() and plot_weekly_trends() now share the same
  week-grouping and current-week-exclusion logic (~40 lines removed)
- print_report(): use module-level defaultdict (drop redundant
  `from collections import defaultdict as _dd`); use PLOT_JOBS.items()
  instead of a separate KEY_JOBS list

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
- week_failure_counts initializer was missing "unknown" key, causing a
  KeyError whenever classify_failure() returned "unknown"
- Per-job fetch: add GitHub error-envelope check ({"message": ...})
  so API errors are counted as fetch failures rather than silently
  becoming zero-job runs
- Per-job fetch: print a note when exactly 100 jobs are returned,
  signalling possible truncation (pagination not yet implemented)
- fetch_runs_since: add explicit isinstance check so a non-dict
  response raises RuntimeError with context rather than bare
  AttributeError
- analyze meta: include job_fetch_failures and
  job_fetch_completeness_pct so downstream consumers can detect
  partial data
- main: write JSON before generating plots so data is never lost
  if plot generation fails
- get_token: always warn when gh auth token exits non-zero, even
  when stderr is empty

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
Replace the packed "med/p90/σ n=X" single-row format with separate
sub-tables: Median, p90, and n (samples), each with one value per cell.
Week labels shortened to "W07" style. Trend arrows (↑↓→) compare the
last two completed weeks per metric. Unicode horizontal rule separator
for visual clarity.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
@cvolkcvolk cvolkcvolk changed the base branch from release/0.1.1 to main March 6, 2026 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant