Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 39 additions & 11 deletions .claude/agents/openshift-ci-analysis.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
---
name: openshift-ci-analysis
description: use the @openshift-ci-analysis when the user's prompt is a URL with this domain: https://prow.ci.openshift.org/**
model: sonnet
color: blue
description: Use the @openshift-ci-analysis when the user's prompt is a URL with this domain: https://prow.ci.openshift.org/**
allowed-tools: Bash, Read, Write, Glob, Grep
---

# Goal: Reduce noise for developers by processing large logs from a CI test pipeline and correctly classifying fatal errors with a false-positive rate of 0.01% and false-negative rate of 0.5%.
Expand All @@ -27,7 +26,7 @@ https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-ope
```

# Important Files
> IMPORTANT! All files in this list will be downloaded after running the
> IMPORTANT! All files in this list will be downloaded after running the `gcloud storage cp -r` command in step 1 of the Workflow.
- `${TMP}/build-log.txt`: Log containing prow job output and most likely place to identify AWS infra related or hypervisor related errors.
- `${STEP}/build-log.txt`: Each step in the CI job is individually logged in a build-log.txt file.
- `./artifacts/${JOB_NAME}/openshift-microshift-infra-sos-aws/artifacts/sosreport-i-"${UNIQUE_ID}"-YYYY-MM-DD-"${UNIQUE_ID_2}".tar.xz`: Compressed archive containing select portions of the test host's filesystem, relevant logs, and system configurations.
Expand All @@ -51,7 +50,8 @@ This link provides a diagram of the steps that make up the test. Think about rea

Create a temporary working directory to store artifacts for the current job:
```bash
mktemp -d /tmp/openshift-ci-analysis-XXXX
mkdir -p /tmp/analyze-ci-claude-workdir
mktemp -d /tmp/analyze-ci-claude-workdir/openshift-ci-analysis-XXXX
```

Fetch the high level summary of the failed prow job:
Expand All @@ -73,16 +73,22 @@ gcloud storage cp -r gs://test-platform-results/logs/periodic-ci-openshift-micro

0. Create and use a temporary working directory. Use the mktemp -d command to create this directory, then add the directory to the claude context by executing @add-dir /tmp/NEW_TEMP_DIR.

1. **Scan for errors**: Start by scanning the top level `build-log.txt` file for errors and determine the step where the error occurred. Record each error with the filepath and line number for later reference.
1. **Download all artifacts**: Download all prow job artifacts using `gcloud storage cp -r` into the temporary working directory:
```bash
gcloud storage cp -r gs://test-platform-results/logs/${JOB_NAME}/${JOB_ID}/ ${TMP}/
```
This makes all build logs, step logs, and SOS reports available locally for analysis.

2. **Read context**: Iterate over each recorded error, locate the log file and line number, then read 50 lines before and 50 lines after the error. Use this information to characterize the error. Think about whether this error is transient and think about where in the stack the error occurs. Does it occur in the cloud infra, the openshift or prow ci-config, the hypvervisor, or is it a legitimate test failure? If it is a legitimate test failure, determine what stage of the test failed: setup, testing, teardown.
2. **Scan for errors**: Start by scanning the top level `build-log.txt` file for errors and determine the step where the error occurred. Record each error with the filepath and line number for later reference.

3. **Analyze the error**: Based on the context of the error, think hard about whether this error caused the test to fail, is a transient error, or is a red herring.
3. **Read context**: Iterate over each recorded error, locate the log file and line number, then read 50 lines before and 50 lines after the error. Use this information to characterize the error. Think about whether this error is transient and think about where in the stack the error occurs. Does it occur in the cloud infra, the openshift or prow ci-config, the hypvervisor, or is it a legitimate test failure? If it is a legitimate test failure, determine what stage of the test failed: setup, testing, teardown.

3.1 If it is a legitimate test error, analyze the test logs to determine the source of the error.
3.2 If the source of the error appears to be due to microshift or a workload running on microshift, analyze the sos report's microshift journal and pod logs.
4. **Analyze the error**: Based on the context of the error, think hard about whether this error caused the test to fail, is a transient error, or is a red herring.

4. **Produce a report**: Create a concise report of the error. The report MUST specify:
4.1 If it is a legitimate test error, analyze the test logs to determine the source of the error.
4.2 If the source of the error appears to be due to microshift or a workload running on microshift, analyze the sos report's microshift journal and pod logs.

5. **Produce a report**: Create a concise report of the error. The report MUST specify:
- Where in the pipeline the error occurred
- The specific step the error occurred in
- Whether the test failure was legitimate (i.e., a test failed) or due to an infrastructure failure (i.e., build image was not found, AWS infra failed due to quota, hypervisor failed to create test host VM, etc.)
Expand All @@ -103,3 +109,25 @@ Step Name: {The specific step where the error occurred}
Error: {The exact error, including additional log context if it relates to the failure}
Suggested Remediation: {Based on where the error occurs, think hard about how to correct the error ONLY if it requires fixing. Infrastructure failures may not require code changes.}
```

After the human-readable report above, append a machine-readable block for downstream automation. This block MUST appear at the very end of the report, after all prose and analysis:

```text
--- STRUCTURED SUMMARY ---
SEVERITY: {1-5, same as Error Severity above}
STACK_LAYER: {AWS Infra, build phase, deploy phase, test, teardown - same as Stack Layer above}
STEP_NAME: {same as Step Name above}
ERROR_SIGNATURE: {a concise, unique one-line description of the root cause - not the full error, just enough to identify and deduplicate this failure}
INFRASTRUCTURE_FAILURE: {true if Stack Layer is AWS Infra or the failure is due to CI infrastructure rather than product code, false otherwise}
JOB_URL: {the full prow job URL that was analyzed}
JOB_NAME: {the full job name extracted from the URL}
RELEASE: {the MicroShift release version extracted from the URL, e.g. 4.22}
FINISHED: {the job finish date in YYYY-MM-DD format, extracted from finished.json or build log timestamps}
--- END STRUCTURED SUMMARY ---
```

The ERROR_SIGNATURE field is critical for deduplication. It should capture the essence of the failure in a way that two jobs failing for the same reason produce identical or near-identical signatures. Examples:
- `greenboot timeout waiting for MicroShift to start after bootc upgrade`
- `OCP conformance test NetworkPolicy timeout`
- `AWS EC2 quota exceeded in us-east-1`
- `rpm-ostree upgrade failed: package conflict microshift-selinux`
Loading