Add retry with jitter to create_issue safe-output handler#26056
Add retry with jitter to create_issue safe-output handler#26056
Conversation
…try logic - Add jitterMs config option to withRetry in error_recovery.cjs (defaults to 0) - Apply random jitter to each retry delay to prevent thundering herd - Wrap main issues.create call in create_issue.cjs with withRetry (15-45s backoff, 10s jitter) - Wrap parent issues.create call in create_issue.cjs with withRetry (same config) - Add tests for jitter behavior in error_recovery.test.cjs - Add retry tests with fake timers in create_issue.test.cjs - Update error assertion to handle enhanced error from withRetry Fixes #26067 Agent-Logs-Url: https://github.com/github/gh-aw/sessions/54f4ec0f-3333-4b98-93ca-b11439da48d0 Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
|
@copilot default jitter is 100ms |
Agent-Logs-Url: https://github.com/github/gh-aw/sessions/4e674a3b-b2a0-405d-802f-31d302024908 Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds jittered retry behavior around create_issue GitHub API calls to reduce synchronized bursts that trigger secondary rate limits (HTTP 403) when multiple workflows complete at the same time.
Changes:
- Extended
withRetryto support ajitterMsoption and updated default retry config. - Wrapped
githubClient.rest.issues.createcalls increate_issue(main + parent issue creation) withwithRetryusing long delays + jitter. - Added/updated tests for jitter behavior and create-issue retry behavior (using fake timers), and relaxed one error-string assertion.
Show a summary per file
| File | Description |
|---|---|
| actions/setup/js/error_recovery.cjs | Adds jitter to retry delays and updates default retry config. |
| actions/setup/js/error_recovery.test.cjs | Adds unit coverage for jitter behavior and updates default-config assertions. |
| actions/setup/js/create_issue.cjs | Wraps issue creation calls with withRetry using jittered backoff. |
| actions/setup/js/create_issue.test.cjs | Adds integration-style retry tests for rate-limit errors and relaxes an assertion due to enhanced error wrapping. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comments suppressed due to low confidence (2)
actions/setup/js/error_recovery.cjs:103
delayWithJittercan exceedmaxDelayMsbecause the cap is applied before adding jitter (e.g., delay==maxDelayMs, jitter>0). IfmaxDelayMsis intended to be a hard upper bound on sleep time, cap after adding jitter (or rename/clarify the semantics).
const jitter = fullConfig.jitterMs > 0 ? Math.floor(Math.random() * fullConfig.jitterMs) : 0;
const delayWithJitter = delay + jitter;
core.info(`Retry attempt ${attempt}/${fullConfig.maxRetries} for ${operationName} after ${delayWithJitter}ms delay`);
await sleep(delayWithJitter);
actions/setup/js/create_issue.cjs:609
withRetryretries a non-idempotentissues.createcall on anyisTransientErrormatch (timeouts/network errors included). If the request actually succeeded but the client timed out, a retry can create duplicate issues. Consider narrowingshouldRetryhere to rate-limit/abuse/5xx cases, or adding an idempotency check (e.g., search bytemporary_idmarker) before retrying.
const { data: issue } = await withRetry(
() =>
githubClient.rest.issues.create({
owner: repoParts.owner,
repo: repoParts.repo,
title,
body,
labels,
assignees,
}),
{ initialDelayMs: 15000, maxDelayMs: 45000, jitterMs: 10000 },
`create_issue in ${qualifiedItemRepo}`
);
- Files reviewed: 4/4 changed files
- Comments generated: 3
actions/setup/js/error_recovery.cjs
Outdated
| * @property {number} initialDelayMs - Initial delay in milliseconds (default: 1000) | ||
| * @property {number} maxDelayMs - Maximum delay in milliseconds (default: 10000) | ||
| * @property {number} backoffMultiplier - Backoff multiplier for exponential backoff (default: 2) | ||
| * @property {number} jitterMs - Maximum random jitter in milliseconds added to each retry delay (default: 0) |
There was a problem hiding this comment.
The JSDoc says jitterMs defaults to 0, but DEFAULT_RETRY_CONFIG sets jitterMs: 100. Please align the documentation with the actual default (or change the default) so callers know whether jitter is enabled when they omit the option.
This issue also appears on line 100 of the same file.
| * @property {number} jitterMs - Maximum random jitter in milliseconds added to each retry delay (default: 0) | |
| * @property {number} jitterMs - Maximum random jitter in milliseconds added to each retry delay (default: 100) |
| const { data: parentIssue } = await withRetry( | ||
| () => | ||
| githubClient.rest.issues.create({ | ||
| owner, | ||
| repo, | ||
| title: template.title, | ||
| body: template.body, | ||
| labels: labels, | ||
| }), | ||
| { initialDelayMs: 15000, maxDelayMs: 45000, jitterMs: 10000 }, | ||
| `create_parent_issue for group ${groupId}` | ||
| ); | ||
|
|
||
| core.info(`Created new parent issue #${parentIssue.number}: ${parentIssue.html_url}`); | ||
| return parentIssue.number; |
There was a problem hiding this comment.
Same non-idempotency concern as the main issue creation: retrying issues.create for the parent issue on network/timeout errors can create multiple parent issues for the same group. Consider narrowing shouldRetry for this call or adding an idempotency guard (e.g., re-search for an existing parent marker before retrying).
This issue also appears on line 597 of the same file.
| const { data: parentIssue } = await withRetry( | |
| () => | |
| githubClient.rest.issues.create({ | |
| owner, | |
| repo, | |
| title: template.title, | |
| body: template.body, | |
| labels: labels, | |
| }), | |
| { initialDelayMs: 15000, maxDelayMs: 45000, jitterMs: 10000 }, | |
| `create_parent_issue for group ${groupId}` | |
| ); | |
| core.info(`Created new parent issue #${parentIssue.number}: ${parentIssue.html_url}`); | |
| return parentIssue.number; | |
| const parentIssueNumber = await withRetry( | |
| async () => { | |
| const existingParentOnRetry = await searchForExistingParent(githubClient, owner, repo, markerComment); | |
| if (existingParentOnRetry) { | |
| core.info(`Found existing parent issue for group ${groupId} during retry; reusing #${existingParentOnRetry}`); | |
| return existingParentOnRetry; | |
| } | |
| const { data: parentIssue } = await githubClient.rest.issues.create({ | |
| owner, | |
| repo, | |
| title: template.title, | |
| body: template.body, | |
| labels: labels, | |
| }); | |
| core.info(`Created new parent issue #${parentIssue.number}: ${parentIssue.html_url}`); | |
| return parentIssue.number; | |
| }, | |
| { initialDelayMs: 15000, maxDelayMs: 45000, jitterMs: 10000 }, | |
| `create_parent_issue for group ${groupId}` | |
| ); | |
| return parentIssueNumber; |
| it("should retry issue creation on transient rate limit error and succeed", async () => { | ||
| mockGithub.rest.issues.create = vi | ||
| .fn() | ||
| .mockRejectedValueOnce(new Error("Secondary rate limit hit")) | ||
| .mockResolvedValue({ | ||
| data: { | ||
| number: 456, | ||
| html_url: "https://github.com/owner/repo/issues/456", | ||
| title: "Retried Issue", | ||
| }, | ||
| }); | ||
|
|
||
| const handler = await main({}); | ||
| const resultPromise = handler({ | ||
| title: "Retried Issue", | ||
| body: "Test body", | ||
| }); | ||
|
|
||
| await vi.runAllTimersAsync(); | ||
| const result = await resultPromise; | ||
|
|
||
| expect(result.success).toBe(true); | ||
| expect(result.number).toBe(456); | ||
| expect(mockGithub.rest.issues.create).toHaveBeenCalledTimes(2); | ||
| }); | ||
|
|
||
| it("should fail after exhausting retries on persistent rate limit error", async () => { | ||
| mockGithub.rest.issues.create = vi.fn().mockRejectedValue(new Error("Secondary rate limit hit")); | ||
|
|
||
| const handler = await main({}); | ||
| const resultPromise = handler({ | ||
| title: "Failing Issue", | ||
| body: "Test body", | ||
| }); | ||
|
|
||
| await vi.runAllTimersAsync(); | ||
| const result = await resultPromise; | ||
|
|
||
| expect(result.success).toBe(false); | ||
| expect(result.error).toBeDefined(); | ||
| // 1 initial + 3 retries = 4 calls | ||
| expect(mockGithub.rest.issues.create).toHaveBeenCalledTimes(4); | ||
| }); |
There was a problem hiding this comment.
The new retry tests only cover the success/failure paths, but don’t assert that the retry delays respect the configured bounds when jitter is enabled. Consider adding an assertion that the scheduled sleeps never exceed the intended maximum delay (especially given the new jitter behavior).
🧪 Test Quality Sentinel ReportTest Quality Score: 83/100✅ Excellent test quality
Test Classification DetailsAll 4 tests
Minor Observations (not blocking)
|
| File | Test additions | Production additions | Ratio |
|---|---|---|---|
create_issue.test.cjs |
+55 | +26 | 2.1:1 |
error_recovery.test.cjs |
+42 | +6 | 7.0:1 |
Both files exceed the 2:1 threshold. This accounts for a 10-point deduction. However, context matters here: the production changes are intentionally minimal (adding jitterMs to a config and wrapping two calls in withRetry), while the tests cover multiple scenarios of the new behavior — the inflation is justified by the feature's nature.
Language Support
Tests analyzed:
- 🐹 Go (
*_test.go): 0 tests — no Go test files changed - 🟨 JavaScript (
*.test.cjs): 4 tests (vitest)
Verdict
✅ Check passed. 0% of new tests are implementation tests (threshold: 30%). All four new tests verify observable behavioral contracts of the retry-with-jitter feature. The inflation flag is raised as an observation but is contextually justified.
📖 Understanding Test Classifications
Design Tests (High Value) verify what the system does:
- Assert on observable outputs, return values, or state changes
- Cover error paths and boundary conditions
- Would catch a behavioral regression if deleted
- Remain valid even after internal refactoring
Implementation Tests (Low Value) verify how the system does it:
- Assert on internal function calls (mocking internals)
- Only test the happy path with typical inputs
- Break during legitimate refactoring even when behavior is correct
- Give false assurance: they pass even when the system is wrong
Goal: Shift toward tests that describe the system's behavioral contract — the promises it makes to its users and collaborators.
🧪 Test quality analysis by Test Quality Sentinel · ● 611.8K · ◷
There was a problem hiding this comment.
✅ Test Quality Sentinel: 83/100. Test quality is excellent — 0% of new tests are implementation tests (threshold: 30%). All four new tests verify observable behavioral contracts of the retry-with-jitter feature. See the full report in the comment above.
|
@copilot review all comments, fix https://github.com/github/gh-aw/actions/runs/24354035161/job/71116005206 |
… bounds test Agent-Logs-Url: https://github.com/github/gh-aw/sessions/3f2bffe7-5bf7-495b-82c0-1fc0e087e97a Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Agent-Logs-Url: https://github.com/github/gh-aw/sessions/3f2bffe7-5bf7-495b-82c0-1fc0e087e97a Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Addressed all review comments in 99a7bfd and 4df60c1:
The non-idempotency suggestion from the automated reviewer (adding Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
This comment has been minimized.
This comment has been minimized.
|
Hey One thing flags this PR for focus: there's an unrelated change bundled in alongside the retry/jitter work:
If you'd like a hand splitting this out, you can assign this prompt to your coding agent:
|
create_issuefails with HTTP 403 when multiple daily workflows complete simultaneously and burst the rate limit within a ~3-minute window. Recurring pattern (3 incidents in 11 days), sole cause of "Degraded" safe-output health.Changes
error_recovery.cjs: AddjitterMsoption towithRetry(default 100ms). When set, addsMath.floor(Math.random() * jitterMs)to each retry delay to desynchronize concurrent callers.create_issue.cjs: Wrap bothissues.createcalls (main + parent) withwithRetry:Non-retryable errors (e.g. "Issues disabled") still fail immediately via
isTransientErrorcheck.Tests: Jitter unit tests in
error_recovery.test.cjs(mockedMath.random); retry integration tests increate_issue.test.cjs(fake timers). Updated one existing assertion from exact match totoContainsincewithRetrynow wraps non-retryable errors with enhanced context.