fix: increase bottlecap compile timeout and add retries to integration tests#1177
Conversation
…n tests - Increase bottlecap compile job timeout from 10m to 20m to reduce false timeout failures - Add retry: 2 to integration-suite to handle transient AWS/Datadog API throttling Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
| timeout: 10m | ||
| retry: | ||
| max: 2 | ||
| when: |
There was a problem hiding this comment.
Pull request overview
This pull request aims to improve CI/CD reliability by addressing timeout issues in the bottlecap compilation job and adding retry logic to the integration-suite job. Specifically, it increases the bottlecap compilation timeout from 10 minutes to 20 minutes (to accommodate cold builds of the large Rust workspace) and adds retry configuration to the integration-suite job to handle transient AWS/Datadog API throttling failures.
Changes:
- Increases bottlecap compile job timeout from 10m to 20m to prevent spurious timeouts on cold builds
- Simplifies bottlecap retry configuration from explicit structured format to shorthand format (
retry: 2) - Adds
retry: 2to the integration-suite job to handle transient API failures
.gitlab/templates/pipeline.yaml.tpl
Outdated
| # This job sometimes times out on GitLab for unclear reasons. | ||
| # Set a timeout with retries to work around this. | ||
| timeout: 20m | ||
| retry: 2 |
There was a problem hiding this comment.
The simplified retry configuration retry: 2 changes the retry behavior from the previous implementation. The original configuration only retried on stuck_or_timeout_failure, runner_system_failure, and script_failure. The simplified form retries on ALL failure conditions, which could cause unintended retries on failures that shouldn't be retried (e.g., configuration errors or other job failures). Consider explicitly specifying the when conditions to maintain the original behavior of retrying only on infrastructure/timeout issues.
| stage: integration-tests | ||
| tags: ["arch:amd64"] | ||
| image: ${CI_DOCKER_TARGET_IMAGE}:${CI_DOCKER_TARGET_VERSION} | ||
| retry: 2 |
There was a problem hiding this comment.
The integration-suite job uses retry: 2 without specifying retry conditions. This will retry on ALL failure types, including legitimate test failures that shouldn't be retried. Given that the PR description indicates retries are intended to handle "transient failures from AWS/Datadog API throttling", consider using a more specific retry configuration with when conditions that targets infrastructure/timeout failures rather than test script failures.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…n tests (#1177) ## Summary - Increases bottlecap compile job timeout from 10m to 20m. This seems to be frequently timing out. - Adds `retry: 2` to `integration-suite` to handle transient failures from AWS/Datadog API throttling --------- Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Summary
retry: 2tointegration-suiteto handle transient failures from AWS/Datadog API throttling