diff --git a/.claude/agents/openshift-ci-analysis.md b/.claude/agents/openshift-ci-analysis.md index bc69dd9965..fe4a28aff1 100644 --- a/.claude/agents/openshift-ci-analysis.md +++ b/.claude/agents/openshift-ci-analysis.md @@ -1,8 +1,7 @@ --- name: openshift-ci-analysis -description: use the @openshift-ci-analysis when the user's prompt is a URL with this domain: https://prow.ci.openshift.org/** -model: sonnet -color: blue +description: Use the @openshift-ci-analysis when the user's prompt is a URL with this domain: https://prow.ci.openshift.org/** +allowed-tools: Bash, Read, Write, Glob, Grep --- # Goal: Reduce noise for developers by processing large logs from a CI test pipeline and correctly classifying fatal errors with a false-positive rate of 0.01% and false-negative rate of 0.5%. @@ -27,7 +26,7 @@ https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-ope ``` # Important Files -> IMPORTANT! All files in this list will be downloaded after running the +> IMPORTANT! All files in this list will be downloaded after running the `gcloud storage cp -r` command in step 1 of the Workflow. - `${TMP}/build-log.txt`: Log containing prow job output and most likely place to identify AWS infra related or hypervisor related errors. - `${STEP}/build-log.txt`: Each step in the CI job is individually logged in a build-log.txt file. - `./artifacts/${JOB_NAME}/openshift-microshift-infra-sos-aws/artifacts/sosreport-i-"${UNIQUE_ID}"-YYYY-MM-DD-"${UNIQUE_ID_2}".tar.xz`: Compressed archive containing select portions of the test host's filesystem, relevant logs, and system configurations. @@ -51,7 +50,8 @@ This link provides a diagram of the steps that make up the test. Think about rea Create a temporary working directory to store artifacts for the current job: ```bash -mktemp -d /tmp/openshift-ci-analysis-XXXX +mkdir -p /tmp/analyze-ci-claude-workdir +mktemp -d /tmp/analyze-ci-claude-workdir/openshift-ci-analysis-XXXX ``` Fetch the high level summary of the failed prow job: @@ -73,16 +73,22 @@ gcloud storage cp -r gs://test-platform-results/logs/periodic-ci-openshift-micro 0. Create and use a temporary working directory. Use the mktemp -d command to create this directory, then add the directory to the claude context by executing @add-dir /tmp/NEW_TEMP_DIR. -1. **Scan for errors**: Start by scanning the top level `build-log.txt` file for errors and determine the step where the error occurred. Record each error with the filepath and line number for later reference. +1. **Download all artifacts**: Download all prow job artifacts using `gcloud storage cp -r` into the temporary working directory: + ```bash + gcloud storage cp -r gs://test-platform-results/logs/${JOB_NAME}/${JOB_ID}/ ${TMP}/ + ``` + This makes all build logs, step logs, and SOS reports available locally for analysis. -2. **Read context**: Iterate over each recorded error, locate the log file and line number, then read 50 lines before and 50 lines after the error. Use this information to characterize the error. Think about whether this error is transient and think about where in the stack the error occurs. Does it occur in the cloud infra, the openshift or prow ci-config, the hypvervisor, or is it a legitimate test failure? If it is a legitimate test failure, determine what stage of the test failed: setup, testing, teardown. +2. **Scan for errors**: Start by scanning the top level `build-log.txt` file for errors and determine the step where the error occurred. Record each error with the filepath and line number for later reference. -3. **Analyze the error**: Based on the context of the error, think hard about whether this error caused the test to fail, is a transient error, or is a red herring. +3. **Read context**: Iterate over each recorded error, locate the log file and line number, then read 50 lines before and 50 lines after the error. Use this information to characterize the error. Think about whether this error is transient and think about where in the stack the error occurs. Does it occur in the cloud infra, the openshift or prow ci-config, the hypvervisor, or is it a legitimate test failure? If it is a legitimate test failure, determine what stage of the test failed: setup, testing, teardown. - 3.1 If it is a legitimate test error, analyze the test logs to determine the source of the error. - 3.2 If the source of the error appears to be due to microshift or a workload running on microshift, analyze the sos report's microshift journal and pod logs. +4. **Analyze the error**: Based on the context of the error, think hard about whether this error caused the test to fail, is a transient error, or is a red herring. -4. **Produce a report**: Create a concise report of the error. The report MUST specify: + 4.1 If it is a legitimate test error, analyze the test logs to determine the source of the error. + 4.2 If the source of the error appears to be due to microshift or a workload running on microshift, analyze the sos report's microshift journal and pod logs. + +5. **Produce a report**: Create a concise report of the error. The report MUST specify: - Where in the pipeline the error occurred - The specific step the error occurred in - Whether the test failure was legitimate (i.e., a test failed) or due to an infrastructure failure (i.e., build image was not found, AWS infra failed due to quota, hypervisor failed to create test host VM, etc.) @@ -103,3 +109,25 @@ Step Name: {The specific step where the error occurred} Error: {The exact error, including additional log context if it relates to the failure} Suggested Remediation: {Based on where the error occurs, think hard about how to correct the error ONLY if it requires fixing. Infrastructure failures may not require code changes.} ``` + +After the human-readable report above, append a machine-readable block for downstream automation. This block MUST appear at the very end of the report, after all prose and analysis: + +```text +--- STRUCTURED SUMMARY --- +SEVERITY: {1-5, same as Error Severity above} +STACK_LAYER: {AWS Infra, build phase, deploy phase, test, teardown - same as Stack Layer above} +STEP_NAME: {same as Step Name above} +ERROR_SIGNATURE: {a concise, unique one-line description of the root cause - not the full error, just enough to identify and deduplicate this failure} +INFRASTRUCTURE_FAILURE: {true if Stack Layer is AWS Infra or the failure is due to CI infrastructure rather than product code, false otherwise} +JOB_URL: {the full prow job URL that was analyzed} +JOB_NAME: {the full job name extracted from the URL} +RELEASE: {the MicroShift release version extracted from the URL, e.g. 4.22} +FINISHED: {the job finish date in YYYY-MM-DD format, extracted from finished.json or build log timestamps} +--- END STRUCTURED SUMMARY --- +``` + +The ERROR_SIGNATURE field is critical for deduplication. It should capture the essence of the failure in a way that two jobs failing for the same reason produce identical or near-identical signatures. Examples: +- `greenboot timeout waiting for MicroShift to start after bootc upgrade` +- `OCP conformance test NetworkPolicy timeout` +- `AWS EC2 quota exceeded in us-east-1` +- `rpm-ostree upgrade failed: package conflict microshift-selinux` diff --git a/.claude/commands/analyze-ci-create-bugs-for-release.md b/.claude/commands/analyze-ci-create-bugs-for-release.md new file mode 100644 index 0000000000..04b6fc5353 --- /dev/null +++ b/.claude/commands/analyze-ci-create-bugs-for-release.md @@ -0,0 +1,397 @@ +--- +name: Create JIRA Bugs for Analyze CI +argument-hint: [--create] +description: Create JIRA bugs from analyze-ci failure reports (dry-run by default) +allowed-tools: Bash, Read, Write, Glob, Grep, Agent, mcp__jira__jira_search, mcp__jira__jira_create_issue, mcp__jira__jira_get_issue +--- + +# analyze-ci-create-bugs-for-release + +## Synopsis +```bash +/analyze-ci-create-bugs-for-release [--create] +``` + +## Description +Reads individual job analysis reports produced by `analyze-ci-for-release` and creates JIRA bugs in USHIFT for legitimate test failures. Operates in **dry-run mode by default** — it shows what bugs would be created without actually creating them. Use `--create` to perform actual issue creation. + +This command does NOT re-analyze CI jobs. It consumes the existing `/tmp/analyze-ci-claude-workdir/analyze-ci-release--job-*.txt` files that were previously generated by `analyze-ci-for-release`. + +## Arguments +- `$ARGUMENTS` (required): Release version, optionally followed by `--create` + - `` (required): Release version (e.g., `4.22`) + - `--create` (optional): Actually create JIRA issues. Without this flag, only a dry-run report is produced. + +## Prerequisites + +- Job analysis files must already exist at `/tmp/analyze-ci-claude-workdir/analyze-ci-release--job-*.txt` + - These are produced by running `/analyze-ci-for-release ` beforehand +- Each job file must contain a `--- STRUCTURED SUMMARY ---` block (see below) +- MCP Jira server must be configured and accessible +- User must have permissions to create issues in USHIFT + +### STRUCTURED SUMMARY Block + +Each job analysis file produced by `openshift-ci-analysis` must end with a machine-readable block: + +```text +--- STRUCTURED SUMMARY --- +SEVERITY: <1-5> +STACK_LAYER: +STEP_NAME: +ERROR_SIGNATURE: +INFRASTRUCTURE_FAILURE: +JOB_URL: +JOB_NAME: +RELEASE: +--- END STRUCTURED SUMMARY --- +``` + +If a job file lacks this block, it is skipped with a warning. + +## Implementation Steps + +### Step 1: Parse Arguments and Locate Job Files + +**Actions**: +1. Parse `$ARGUMENTS` to extract `` and detect `--create` flag +2. Determine mode: if `--create` is present, set `MODE=create`; otherwise `MODE=dry-run` +3. Glob for job files: `/tmp/analyze-ci-claude-workdir/analyze-ci-release--job-*.txt` +4. If no files found, report error and stop + +**Error Handling**: +- No arguments: show usage and stop +- No job files found: suggest running `/analyze-ci-for-release ` first + +### Step 2: Parse STRUCTURED SUMMARY from Each Job File + +**Actions**: +1. For each job file, extract the `--- STRUCTURED SUMMARY ---` block +2. Parse key-value pairs: SEVERITY, STACK_LAYER, STEP_NAME, ERROR_SIGNATURE, INFRASTRUCTURE_FAILURE, JOB_URL, JOB_NAME, RELEASE +3. Also capture the full file content for use in the bug description (the error context and analysis above the structured block) +4. If a file lacks the structured block, log a warning and skip it + +**Parsing approach**: Use grep/sed in Bash to extract the block between `--- STRUCTURED SUMMARY ---` and `--- END STRUCTURED SUMMARY ---`, then parse each `KEY: value` line. + +**Data structure per job**: +```text +{ + severity: number, + stack_layer: string, + step_name: string, + error_signature: string, + infrastructure_failure: boolean, + job_url: string, + job_name: string, + release: string, + analysis_text: string, # full file content for bug description + source_file: string # path to the job file +} +``` + +### Step 3: Filter Out Non-Bug-Worthy Failures + +**Actions**: +1. Remove entries where `SEVERITY <= 2` (minor/flaky issues) +2. Remove entries where `INFRASTRUCTURE_FAILURE=true` **AND** the failure is transient/external (not a CI configuration bug). Use `STACK_LAYER` to distinguish: + - **Filter out** (transient infrastructure — out of our control): + - `STACK_LAYER` contains `AWS Infra` (AWS quota, VM creation, networking) + - `STACK_LAYER` contains `External Infrastructure` (container registry outages, third-party services) + - `STACK_LAYER` is `build phase` (release image import timeouts, registry 404s) + - **Keep as bug-worthy** (CI configuration issues — need code fixes): + - `STACK_LAYER` is `test setup phase`, `Test Configuration`, or similar (missing test files, wrong directory mappings, broken scenario selection logic) + - Any `INFRASTRUCTURE_FAILURE=true` entry whose analysis text indicates the fix requires a code change in the repository (e.g., adding missing files, updating directory mappings, fixing CI scripts) +3. Log each filtered-out entry with reason. For kept CI configuration failures, log them as `"CI CONFIG (kept)"` to distinguish from product test failures. + +**Rationale**: Not all "infrastructure failures" are transient. A missing test scenario directory or broken CI script mapping is a configuration bug that requires a code fix — these should result in JIRA bugs, not be silently filtered out. Only truly external/transient failures (AWS outages, registry issues, image import timeouts) should be excluded. + +**Output**: Filtered list of bug-worthy failures with count of filtered entries reported to user. + +### Step 4: Deduplicate by ERROR_SIGNATURE + +**Actions**: +1. Group remaining entries by `ERROR_SIGNATURE` similarity + - Exact matches are grouped together + - Near-matches (same error but slightly different wording) should also be grouped — use your judgment to identify when two signatures describe the same root cause +2. For each group, create a "bug candidate": + - **Representative signature**: the ERROR_SIGNATURE that best describes the group + - **Affected jobs**: list of all JOB_NAME + JOB_URL in the group + - **Max severity**: highest SEVERITY in the group + - **Step names**: unique STEP_NAME values in the group + - **Analysis text**: from the highest-severity job in the group + +**Output**: List of deduplicated bug candidates. + +### Step 5: Search Jira for Existing Bugs + +For each bug candidate, run ALL of the following searches. Each search is MANDATORY — do not skip any. + +**Search A — Keyword search**: +1. Extract 2-4 distinctive keywords from the error signature (avoid generic words like "error", "failed", "test") +2. Run: + ```python + mcp__jira__jira_search( + jql='((project = OCPBUGS AND component = MicroShift) OR project = USHIFT) AND text ~ "" AND status not in (Closed, Verified) ORDER BY created DESC', + limit=5 + ) + ``` + +**Search B — Test case ID search (MANDATORY when IDs are present)**: +Extract ALL numeric IDs from the error signature that could be test case references (typically 4-6 digit numbers like `68256`). For EACH numeric ID found, run TWO separate searches: +```text +# Search B1: bare number +jql: ... AND text ~ "68256" AND status not in (Closed, Verified) ... + +# Search B2: OCP-prefixed form (OpenShift Polarion convention) +jql: ... AND text ~ "OCP-68256" AND status not in (Closed, Verified) ... +``` +**Why both forms are required**: Jira's text indexer treats `OCP-68256` as a single token, so `text ~ "68256"` will NOT match issues containing `OCP-68256`, and vice versa. Skipping either form WILL cause missed duplicates. + +**After all searches**: +1. Merge and deduplicate results from all search queries (A, B1, B2) +2. If potential duplicates are found, fetch their details with `mcp__jira__jira_get_issue` to show summary and status + +**Note**: Run searches in parallel where possible. + +### Step 6: Present Bug Candidates to User + +**Actions**: +1. Display a numbered list of all bug candidates with: + - Summary (derived from error signature) + - Severity and affected job count + - Step name(s) where failure occurred + - List of affected job URLs + - Potential duplicate JIRAs found (if any), with key, summary, and status + - Mode indicator: `[DRY-RUN]` or `[WILL CREATE]` + +2. **In dry-run mode** (`--create` NOT specified): + - Display all candidates with `[DRY-RUN]` prefix + - After listing all candidates, show a summary: + ```text + DRY-RUN SUMMARY + Total job files parsed: N + Filtered out (infra/low severity): N + Unique bug candidates: N + Candidates with potential duplicates: N + Candidates ready to file: N + + To create these bugs, run: + /analyze-ci-create-bugs-for-release --create + ``` + - Do NOT prompt for any actions. Do NOT create any issues. Stop here. + +3. **In create mode** (`--create` specified): + - For each candidate, prompt the user: + ```text + Bug Candidate N/M: + Summary: "" + Severity: X (affects Y jobs) + Step: + Jobs: + - + - + Potential Duplicates: + - USHIFT-XXXXX: "" [Status] (or OCPBUGS-YYYYY) + (or "None found") + + Action? [c]reate / [s]kip / [l]ink-to-existing : + ``` + - **create**: Proceed to Step 7 + - **skip**: Skip this candidate, move to next + - **link-to-existing**: Validate the key by calling `mcp__jira__jira_get_issue(issue_key=)`. If the issue exists, record the key and move to next. If the call fails or returns not-found, show an error (e.g., `"JIRA key not found — check for typos"`) and re-prompt with the same `Action?` choices. + +### Step 7: Create Bug via MCP (create mode only) + +**Actions**: +For each candidate where user chose "create": + +1. **Construct the bug summary**: + - Format: `"MicroShift CI: "` (truncate to 100 chars if needed) + +2. **Construct the bug description** using **Markdown** format (the MCP Jira tool accepts Markdown and automatically converts it to Jira wiki markup — do NOT write Jira wiki markup directly): + ```text + ## Description of problem + + Periodic CI job failures detected for MicroShift . + + + + ## Version-Release number of selected component (if applicable) + + + + ## How reproducible + + Always (fails consistently in CI) + + ## Steps to Reproduce + + 1. Run the periodic CI job(s) listed below + 2. Observe failure in step: + + ## Actual results + + ```` + + ```` + + ## Expected results + + CI job should pass successfully. + + ## Additional info + + **Stack Layer:** + **CI Step:** + **Error Severity:** /5 + **Number of affected jobs:** + + **Affected Jobs:** + + - []() + + + **Source:** Auto-generated by /analyze-ci-create-bugs-for-release from analyze-ci-for-release output. + ``` + +3. **Create the issue**: + ```python + mcp__jira__jira_create_issue( + project_key="USHIFT", + summary="MicroShift CI: ", + issue_type="Bug", + description="", + components="MicroShift", + additional_fields={ + "versions": [{"name": ""}], + "labels": ["microshift-ci-ai-generated"], + "security": {"name": "Red Hat Employee"} + } + ) + ``` + +4. **Record the result**: Store the created issue key for the final report. + +**Error Handling**: +- If MCP call fails, report error, ask user if they want to retry or skip +- Do NOT retry automatically + +### Step 8: Generate Results Report + +**Actions**: +1. Run `mkdir -p /tmp/analyze-ci-claude-workdir` using the `Bash` tool +2. Save report to `/tmp/analyze-ci-claude-workdir/analyze-ci-create-bugs--.txt` +3. Display summary to user: + +**Dry-run report format**: +```text +=============================================================== +ANALYZE-CI CREATE BUGS - DRY-RUN REPORT +Release: +Date: YYYY-MM-DD +=============================================================== + +PARSING + Job files found: N + Successfully parsed: N + Skipped (no structured summary): N + +FILTERING + Infrastructure failures removed: N + Low severity (<=2) removed: N + Bug-worthy failures: N + +DEDUPLICATION + Unique bug candidates: N + +CANDIDATES + + 1. MicroShift CI: + Severity: X | Jobs: Y | Step: + Potential Duplicates: USHIFT-XXXXX, OCPBUGS-YYYYY (or "None") + + 2. MicroShift CI: + ... + +To create these bugs, run: + /analyze-ci-create-bugs-for-release --create + +Report saved: /tmp/analyze-ci-claude-workdir/analyze-ci-create-bugs--.txt +=============================================================== +``` + +**Create mode report format**: +```text +=============================================================== +ANALYZE-CI CREATE BUGS - CREATION REPORT +Release: +Date: YYYY-MM-DD +=============================================================== + +RESULTS + + 1. USHIFT-12345 (CREATED) + MicroShift CI: + URL: https://redhat.atlassian.net/browse/USHIFT-12345 + + 2. SKIPPED + MicroShift CI: + Reason: User skipped + + 3. USHIFT-99999 (LINKED TO EXISTING) + MicroShift CI: + Reason: Duplicate of existing issue + +SUMMARY + Created: N + Skipped: N + Linked to existing: N + Failed: N + +Report saved: /tmp/analyze-ci-claude-workdir/analyze-ci-create-bugs--.txt +=============================================================== +``` + +## Examples + +### Example 1: Dry-Run (Default) +```bash +/analyze-ci-create-bugs-for-release 4.22 +``` +Shows what bugs would be created without creating anything. + +### Example 2: Create Bugs +```bash +/analyze-ci-create-bugs-for-release 4.22 --create +``` +Interactively creates bugs, prompting for each candidate. + +### Example 3: No Job Files Found +```bash +/analyze-ci-create-bugs-for-release 4.19 +``` +```text +Error: No job analysis files found at /tmp/analyze-ci-claude-workdir/analyze-ci-release-4.19-job-*.txt + +Run the analysis first: + /analyze-ci-for-release 4.19 +``` + +## Notes + +- This command does NOT run CI analysis — it only consumes existing analysis files from `/tmp/analyze-ci-claude-workdir` +- Dry-run is the default to prevent accidental bug creation +- The `--create` flag triggers interactive mode where each candidate requires user confirmation +- Transient infrastructure failures (AWS, VM creation, quota, registry outages) are automatically filtered out +- CI configuration failures (missing test files, broken directory mappings) are kept as bug-worthy even if marked as INFRASTRUCTURE_FAILURE=true +- Bugs are created in USHIFT with component "MicroShift"; duplicate search covers both USHIFT and OCPBUGS +- All created bugs are labeled with `microshift-ci-ai-generated` for tracking +- Security level is set to "Red Hat Employee" on all created issues +- The STRUCTURED SUMMARY block in job files is required — this is a contract with `openshift-ci-analysis` + +## Related Skills + +- **analyze-ci-for-release**: Produces the job analysis files consumed by this command +- **analyze-ci-for-release-manager**: Orchestrator that will integrate this command +- **openshift-ci-analysis**: Agent that produces individual job reports with STRUCTURED SUMMARY +- **jira:create-bug**: Interactive bug creation (not used here — we call MCP directly) diff --git a/.claude/commands/analyze-ci-for-pull-requests.md b/.claude/commands/analyze-ci-for-pull-requests.md index ec0f2b6c29..d514c980f2 100644 --- a/.claude/commands/analyze-ci-for-pull-requests.md +++ b/.claude/commands/analyze-ci-for-pull-requests.md @@ -2,7 +2,7 @@ name: Analyze CI for Pull Requests argument-hint: [--rebase] [--limit N] description: Analyze CI for open MicroShift pull requests and produce a summary of failures -allowed-tools: Skill, Bash, Read, Write, Glob, Grep, Agent +allowed-tools: Bash, Read, Write, Glob, Grep, Agent --- # analyze-ci-for-pull-requests @@ -20,7 +20,7 @@ This command orchestrates the analysis workflow by: 1. Fetching the list of open PRs and their failed jobs using `.claude/scripts/microshift-prow-jobs-for-pull-requests.sh --mode detail` 2. Filtering to only PRs that have at least one failed job 3. Analyzing each failed job individually using the `openshift-ci-analysis` agent -4. Aggregating results into a summary report saved to `/tmp` +4. Aggregating results into a summary report saved to `/tmp/analyze-ci-claude-workdir` ## Arguments - `--rebase` (optional): Only analyze rebase PRs (titles containing `NO-ISSUE: rebase-release-`) @@ -75,7 +75,7 @@ bash .claude/scripts/microshift-prow-jobs-for-pull-requests.sh --mode detail --f 1. For each failed job URL from Step 1: - Call the `openshift-ci-analysis` agent with the job URL **in parallel** - Capture the analysis result (failure reason, error summary) - - Store all intermediate analysis files in `/tmp` + - Store all intermediate analysis files in `/tmp/analyze-ci-claude-workdir` 2. Progress reporting: - Show "Analyzing job X/Y: (PR #NNN)" for each job @@ -92,8 +92,9 @@ For each job analysis, extract: - Affected test scenarios (if applicable) **File Storage**: -All intermediate analysis files are stored in `/tmp` with naming pattern: -- `/tmp/analyze-ci-prs-job--pr-.txt` +Before writing any files, run `mkdir -p /tmp/analyze-ci-claude-workdir` using the `Bash` tool. +All intermediate analysis files are stored in `/tmp/analyze-ci-claude-workdir` with naming pattern: +- `/tmp/analyze-ci-claude-workdir/analyze-ci-prs-job--pr-.txt` ### Step 3: Aggregate Results and Identify Patterns @@ -101,7 +102,7 @@ All intermediate analysis files are stored in `/tmp` with naming pattern: **Actions**: 1. Collect results from all parallel job analyses - - Read individual job analysis files from `/tmp` + - Read individual job analysis files from `/tmp/analyze-ci-claude-workdir` - Extract key findings from each analysis 2. Group failures by PR: @@ -118,7 +119,7 @@ All intermediate analysis files are stored in `/tmp` with naming pattern: **Actions**: 1. Aggregate all job analysis results from parallel execution 2. Identify common patterns and group by PR and failure type -3. Generate summary report and save to `/tmp/analyze-ci-prs-summary..txt` +3. Generate summary report and save to `/tmp/analyze-ci-claude-workdir/analyze-ci-prs-summary..txt` 4. Display the summary to the user **Report Structure**: @@ -133,7 +134,7 @@ OVERVIEW PRs with Failures: 2 Total Failed Jobs: 9 Analysis Date: 2026-03-15 - Report: /tmp/analyze-ci-prs-summary.20260315-143022.txt + Report: /tmp/analyze-ci-claude-workdir/analyze-ci-prs-summary.20260315-143022.txt PER-PR BREAKDOWN @@ -170,7 +171,7 @@ COMMON PATTERNS (across PRs) ═══════════════════════════════════════════════════════════════ -Individual job reports: /tmp/analyze-ci-prs-job-*.txt +Individual job reports: /tmp/analyze-ci-claude-workdir/analyze-ci-prs-job-*.txt ``` ## Examples @@ -213,7 +214,7 @@ Individual job reports: /tmp/analyze-ci-prs-job-*.txt - **Network Usage**: Each job analysis fetches logs from GCS - **Parallelization**: All job analyses run in parallel for maximum efficiency - **Use --limit**: For quick checks, use --limit flag to analyze a subset -- **File Storage**: All intermediate and report files are stored in `/tmp` directory +- **File Storage**: All intermediate and report files are stored in `/tmp/analyze-ci-claude-workdir` directory ## Prerequisites @@ -253,7 +254,7 @@ Please ensure you're in the microshift project directory. - This skill focuses on **presubmit** PR jobs (not periodic/postsubmit) - Analysis is read-only - no modifications to CI data or PRs -- Results are saved in files in /tmp directory with a timestamp +- Results are saved in files in /tmp/analyze-ci-claude-workdir directory with a timestamp - Provide links to the jobs in the summary - Only present a concise analysis summary for each job - PRs with no Prow jobs (e.g., drafts without triggered tests) are skipped diff --git a/.claude/commands/analyze-ci-for-release-manager.md b/.claude/commands/analyze-ci-for-release-manager.md index e3c39c1f96..7dfadea206 100644 --- a/.claude/commands/analyze-ci-for-release-manager.md +++ b/.claude/commands/analyze-ci-for-release-manager.md @@ -2,7 +2,7 @@ name: Analyze CI for Release Manager argument-hint: description: Analyze CI for multiple MicroShift releases and produce an HTML summary -allowed-tools: Skill, Bash, Read, Write, Glob, Grep, Agent +allowed-tools: Bash, Read, Write, Glob, Grep, Agent --- # analyze-ci-for-release-manager @@ -13,7 +13,7 @@ allowed-tools: Skill, Bash, Read, Write, Glob, Grep, Agent ``` ## Description -Accepts a comma-separated list of MicroShift release versions, runs the `analyze-ci-for-release` skill for each release and the `analyze-ci-for-pull-requests --rebase` skill for open rebase PRs, and produces a single HTML summary file consolidating all results. The HTML report uses tabs to separate Periodics (per-release) and Pull Requests sections. +Accepts a comma-separated list of MicroShift release versions, runs the `analyze-ci-for-release` command for each release and the `analyze-ci-for-pull-requests --rebase` command for open rebase PRs, and produces a single HTML summary file consolidating all results. The HTML report uses tabs to separate Periodics (per-release) and Pull Requests sections. ## Arguments - `$ARGUMENTS` (required): Comma-separated list of release versions (e.g., `4.19,4.20,4.21,4.22`) @@ -34,13 +34,11 @@ Accepts a comma-separated list of MicroShift release versions, runs the `analyze ### Step 2: Analyze Each Release (Periodics) **Actions**: -1. For each release version from the parsed list, invoke the `analyze-ci-for-release` skill: +1. For each release version from the parsed list, launch the `analyze-ci-for-release` command as an **Agent** (using the `Agent` tool, NOT the `Skill` tool): ```text - Skill: analyze-ci-for-release, args: "" + Agent: subagent_type=general-purpose, prompt="Run /analyze-ci-for-release " ``` -2. Run releases **sequentially** (each skill invocation is a full analysis) -3. After each skill completes, note the summary report file path it produced (typically `/tmp/analyze-ci-release--summary.*.txt`) -4. Track which releases succeeded and which failed +2. Launch all releases **in parallel** as separate agents — do NOT wait for one to finish before starting the next **Progress Reporting**: ```text @@ -50,222 +48,156 @@ Analyzing release X/Y: ### Step 3: Analyze Rebase Pull Requests **Actions**: -1. Invoke the `analyze-ci-for-pull-requests` skill with `--rebase` argument: +1. Launch the `analyze-ci-for-pull-requests` command as an **Agent** (using the `Agent` tool, NOT the `Skill` tool) with `--rebase` argument: ```text - Skill: analyze-ci-for-pull-requests, args: "--rebase" + Agent: subagent_type=general-purpose, prompt="Run /analyze-ci-for-pull-requests --rebase" ``` -2. After the skill completes, note the summary report file path (typically `/tmp/analyze-ci-prs-summary.*.txt`) -3. If no rebase PRs are found, note "No open rebase PRs" for the report - -**Progress Reporting**: -```text -Analyzing rebase pull requests... -``` +2. Launch this agent **in parallel** with the release agents in Step 2 — do NOT wait for Step 2 agents to finish first ### Step 4: Collect All Results **Actions**: -1. After all analyses complete, gather all summary files: - - Periodics: `/tmp/analyze-ci-release--summary.*.txt` for each version - - Pull Requests: `/tmp/analyze-ci-prs-summary.*.txt` - - Per-job files: `/tmp/analyze-ci-release--job-*.txt` and `/tmp/analyze-ci-prs-job-*.txt` -2. Read each summary file to extract the analysis content -3. If a summary file is missing for a release, note it as "Analysis failed or produced no output" -4. If no PR summary file exists, note "No open rebase PRs or no failures found" +1. **IMPORTANT**: Wait until ALL agents from Steps 2 and 3 are confirmed complete +2. Track which releases succeeded and which failed +3. If no rebase PRs were found, note "No open rebase PRs" for the report +4. Gather per-job files: + - Per-job files: `/tmp/analyze-ci-claude-workdir/analyze-ci-release--job-*.txt` for each version + - PR per-job files: `/tmp/analyze-ci-claude-workdir/analyze-ci-prs-job-*.txt` +5. If no per-job files exist for a release, note it as "Analysis failed or produced no output" +6. If no PR job files exist, note "No open rebase PRs or no failures found" +7. Do NOT read the files manually — the Python script in Step 5 will read and parse them directly ### Step 5: Generate HTML Summary Report -**Goal**: Create a single HTML file at `/tmp/microshift-ci-release-manager-.html` that consolidates all analyses with tabbed navigation. +**Goal**: Create a single HTML file at `/tmp/analyze-ci-claude-workdir/microshift-ci-release-manager-.html` that consolidates all analyses with tabbed navigation. The report must be **concise at the top level** with expandable details per job. **Actions**: -1. Generate the HTML report with the structure described below -2. Save to `/tmp/microshift-ci-release-manager-.html` where `` is `YYYYMMDD-HHMMSS` -3. **IMPORTANT**: Use the `Bash` tool with `cat <<'HTMLEOF' > /tmp/microshift-ci-release-manager-.html` (heredoc) to write the file, NOT the `Write` tool. This ensures the absolute `/tmp` path is used and avoids permission prompts. -4. Display the file path to the user in the end. +1. Write a Python script to `/tmp/analyze-ci-claude-workdir/gen_html.py` and execute it. Using Python (not a heredoc) is required because it handles HTML escaping and URL-to-link conversion properly. +2. Save output to `/tmp/analyze-ci-claude-workdir/microshift-ci-release-manager-.html` where `` is `YYYYMMDD-HHMMSS` +3. Display the file path to the user in the end, AFTER the summary **HTML Structure**: -The HTML file must be a self-contained, single-file document with embedded CSS and JS. Use the following structure: - -```html - - - - - MicroShift CI Release Manager Report - YYYY-MM-DD - - - -
-

MicroShift CI Release Manager Report

-

Generated: YYYY-MM-DD HH:MM:SS UTC

- - -
- -
-
N
-
Release X.YY Failed Jobs
-
- -
-
N
-
Rebase PRs Failed Jobs
-
-
- - -
- - -
- - -
- - -
-

Releases Analyzed

- -
- - -
-
-

Release X.YY

- N failed jobs -
- - -
- ... periodics analysis content ... -
-
- - -
- - -
- - -
-
-

PR #NNN: title

- N failed jobs -
- - -
- ... PR analysis content ... -
-
- - -
-

No open rebase pull requests found.

-
-
-
- - - - +function toggleDetail(id) { + var row = document.getElementById(id); + row.classList.toggle('show'); + row.previousElementSibling.classList.toggle('active'); +} ``` -**Content Guidelines**: -- Do NOT re-analyze or reinterpret the data from `analyze-ci-for-release` or `analyze-ci-for-pull-requests` - use their output as-is -- Convert the plain text analysis reports into HTML-formatted content, preserving all information -- Ensure all Prow job URLs from the original analyses remain clickable links in the HTML -- Use appropriate badge colors: - - `badge-ok`: 0 failed jobs - - `badge-issues`: 1+ failed jobs - - `badge-critical`: 5+ failed jobs or CRITICAL severity issues present - - `badge-nodata`: analysis failed or no data -- Make per-job details collapsible to keep the page manageable -- Each collapsible job header in the Periodics tab MUST include the job's finish date (from the Prow job listing) displayed on the right side using `YYYY-MM-DD`. Example: `
1. e2e-aws-tests-nightly - Root Cause Summary 2026-03-17
` -- The overview cards should show the number of failed jobs per release and for rebase PRs at a glance -- The **Periodics** tab contains the per-release periodic job analyses (same as before) -- The **Pull Requests** tab contains the rebase PR analyses grouped by PR - ### Step 6: Report Completion **Actions**: -1. Display the path to the generated HTML file -2. Provide a brief text summary listing each release and its failed job count, plus rebase PR status +1. Provide a brief text summary listing each release and its failed job count, plus rebase PR status +2. Display the path to the generated HTML file **Example Output**: ```text -HTML report generated: /tmp/microshift-ci-release-manager-20260315-143022.html - Summary: Periodics: Release 4.19: 3 failed periodic jobs @@ -274,6 +206,8 @@ Summary: Release 4.22: 12 failed periodic jobs Pull Requests: 2 rebase PRs with 5 total failed jobs + +HTML report generated: /tmp/analyze-ci-claude-workdir/microshift-ci-release-manager-20260315-143022.html ``` ## Examples @@ -294,12 +228,11 @@ Summary: ``` ## Notes -- Each release analysis uses the `analyze-ci-for-release` skill - this command does NOT duplicate that logic -- Rebase PR analysis uses the `analyze-ci-for-pull-requests --rebase` skill +- Each release analysis launches `analyze-ci-for-release` as an **Agent** (not a Skill) - this command does NOT duplicate that logic +- Rebase PR analysis launches `analyze-ci-for-pull-requests --rebase` as an **Agent** (not a Skill) +- All agents (releases + PR analysis) are launched in parallel for maximum efficiency - The HTML report is self-contained (no external CSS/JS dependencies) -- All intermediate files from `analyze-ci-for-release` and `analyze-ci-for-pull-requests` remain available in `/tmp` -- Releases are analyzed sequentially since each invocation is resource-intensive -- The rebase PR analysis runs after all releases are analyzed +- All intermediate files from `analyze-ci-for-release` and `analyze-ci-for-pull-requests` remain available in `/tmp/analyze-ci-claude-workdir` - The HTML file can be opened in any browser for convenient examination - If a release analysis fails, it is noted in the report but does not block other releases - If no rebase PRs are open, the Pull Requests tab shows "No open rebase pull requests found" diff --git a/.claude/commands/analyze-ci-for-release.md b/.claude/commands/analyze-ci-for-release.md index 05f82a4da2..48502550fa 100644 --- a/.claude/commands/analyze-ci-for-release.md +++ b/.claude/commands/analyze-ci-for-release.md @@ -2,7 +2,7 @@ name: Analyze CI for a Release argument-hint: description: Analyze CI for a MicroShift release using openshift-ci-analysis agent and produce a summary -allowed-tools: Skill, Bash, Read, Write, Glob, Grep, Agent +allowed-tools: Bash, Read, Write, Glob, Grep, Agent --- # analyze-ci-for-release @@ -65,7 +65,7 @@ Found 17 failed periodic jobs for release 4.22 - Call the `openshift-ci-analysis` agent with the job URL **in parallel** - Capture the analysis result (failure reason, error summary) - Track common patterns across jobs - - Store all intermediate analysis files in `/tmp` + - Store all intermediate analysis files in `/tmp/analyze-ci-claude-workdir` 2. Progress reporting: - Show "Analyzing job X/Y: " for each job @@ -89,8 +89,9 @@ For each job analysis, extract: - Affected test scenarios (if applicable) **File Storage**: -All intermediate analysis files are stored in `/tmp` with naming pattern: -- `/tmp/analyze-ci-release--job--.txt` +Before writing any files, run `mkdir -p /tmp/analyze-ci-claude-workdir` using the `Bash` tool. +All intermediate analysis files are stored in `/tmp/analyze-ci-claude-workdir` with naming pattern: +- `/tmp/analyze-ci-claude-workdir/analyze-ci-release--job--.txt` ### Step 3: Aggregate Results and Identify Patterns @@ -98,7 +99,7 @@ All intermediate analysis files are stored in `/tmp` with naming pattern: **Actions**: 1. Collect results from all parallel job analyses - - Read individual job analysis files from `/tmp` + - Read individual job analysis files from `/tmp/analyze-ci-claude-workdir` - Extract key findings from each analysis 2. Group jobs by failure type: @@ -123,7 +124,7 @@ All intermediate analysis files are stored in `/tmp` with naming pattern: **Actions**: 1. Aggregate all job analysis results from parallel execution 2. Identify common patterns and group by failure type -3. Generate summary report and save to `/tmp/analyze-ci-release--summary..txt` +3. Generate summary report and save to `/tmp/analyze-ci-claude-workdir/analyze-ci-release--summary..txt` 4. Display the summary to the user **Report Structure**: @@ -136,7 +137,7 @@ MICROSHIFT 4.22 RELEASE - FAILED JOBS ANALYSIS 📊 OVERVIEW Total Failed Jobs: 17 Analysis Date: 2026-03-14 - Report saved to: /tmp/analyze-ci-release-4.22-summary.txt + Report saved to: /tmp/analyze-ci-claude-workdir/analyze-ci-release-4.22-summary..txt 📋 FAILURE BREAKDOWN Build Failures: 0 jobs @@ -178,7 +179,7 @@ MICROSHIFT 4.22 RELEASE - FAILED JOBS ANALYSIS ═══════════════════════════════════════════════════════════════ Individual job reports available in: - /tmp/analyze-ci-release-4.22-job-*.txt + /tmp/analyze-ci-claude-workdir/analyze-ci-release-4.22-job-*.txt ``` ## Examples @@ -221,7 +222,7 @@ Individual job reports available in: - **Network Usage**: Moderate to high - all jobs analyzed in parallel fetch logs from GCS simultaneously - **Parallelization**: All jobs are analyzed in parallel for maximum efficiency - **Use --limit**: For quick checks, use --limit flag to analyze subset -- **File Storage**: All intermediate and report files are stored in `/tmp` directory +- **File Storage**: All intermediate and report files are stored in `/tmp/analyze-ci-claude-workdir` directory ## Prerequisites @@ -285,7 +286,7 @@ Run periodically and compare summaries over time to identify regression patterns - This skill focuses on **periodic** jobs only (not presubmit/postsubmit) - Analysis is read-only - no modifications to CI data -- Results are saved in files in /tmp directory with a timestamp +- Results are saved in files in /tmp/analyze-ci-claude-workdir directory with a timestamp - Provide links to the jobs in the summary - Only present a concise analysis summary for each job - Pattern detection improves with more jobs analyzed (avoid limiting unless needed) diff --git a/.claude/settings.json b/.claude/settings.json index fb1b944373..29d5b410b5 100644 --- a/.claude/settings.json +++ b/.claude/settings.json @@ -15,6 +15,7 @@ "Bash(gh pr list:*)", "Bash(gh auth status:*)", "WebFetch(domain:prow.ci.openshift.org)", + "Skill(analyze-ci-create-release-report)", "Skill(analyze-ci-for-pull-requests)", "Skill(analyze-ci-for-release)", "Skill(analyze-ci-for-release-manager)", diff --git a/.gitignore b/.gitignore index ddbac5b896..97832b0f7b 100644 --- a/.gitignore +++ b/.gitignore @@ -1,12 +1,10 @@ .idea/ .vagrant/ .vscode/ -Vagrantfile -_output/* -_output -sshfile +.work/ +_output/ +__pycache__/ ansible/*.txt test/variables.yaml test/scenario_settings.sh -__pycache__ test/dev_overrides.sh