Handle failing push better#1193
Conversation
git push/fetch over HTTPS spawns helpers (git-remote-https, credential helpers) that inherit the parent's stdio pipes. The default exec.CommandContext cancellation only SIGKILLs the direct child, so the helpers keep stdout open and CombinedOutput blocks on Read past the context deadline. tryPushSessionsCommon's 2-minute timeout was therefore unbounded in practice — observed in the wild as a runaway dot stream on [entire] Pushing entire/checkpoints/v1 to origin... that only Ctrl-C could end. Add execx.KillOnCancel(cmd, waitDelay) which sets Setpgid (or CREATE_NEW_PROCESS_GROUP on Windows), a Cancel that SIGKILLs the whole group, and WaitDelay as a pipe-drain backstop. newCommand in checkpoint/remote applies it to every git subprocess routed through the package. Tests: - TestExecCommandContext_HangsOnGrandchildHoldingPipe pins the stdlib behavior that motivates the fix (5s hang vs 200ms deadline). - TestKillOnCancel_TerminatesGrandchildHoldingPipe verifies the helper returns within WaitDelay in the same scenario. - TestNewCommand_AppliesKillOnCancel confirms the wiring. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Entire-Checkpoint: 24d37e721664
A colleague's stuck push surfaced four gaps in how doPushBranch
communicates failure. Fix each, plus add tests that pin the new
behaviour.
1. fetchAndRebaseSessionsCommon now returns errFetchTimedOut when its
inner deadline fires (distinct from server errors) and falls back to
the wrapped fetch error when git's output is empty — so users no
longer see a bare "Warning: couldn't sync ...: fetch failed:" with
nothing after the colon.
2. tryPushSessionsCommon emits errPushTimedOut on its inner deadline.
doPushBranch pairs this with " timed out" as the dot-line suffix, so
the user sees the failure mode instead of the dots silently trailing
off into a "Syncing" line that looks like normal progression.
3. Every push attempt logs at INFO with branch, target (display alias —
no URLs/tokens), elapsed, and a coarse class field. Doctor bundles
now record push history even at default log level; previously the
only push trace was a DEBUG line that nobody had enabled.
4. doPushBranch short-circuits in two cases that previously cascaded
confusingly:
- outer ctx cancelled (user Ctrl-C, hook timeout) → bail silently
instead of running fetch+rebase that will also fail.
- inner push timed out → skip the sync→retry cascade entirely
(same network condition would just trip the fetch timeout) and
print a hint pointing the user at log_level=DEBUG.
A pushStderr package writer replaces direct os.Stderr writes from
doPushBranch so tests can capture the new output without globally
swapping os.Stderr; the legacy captureStderr helper patches both vars
for backward compatibility with existing hint tests.
Tests in push_messaging_test.go cover each behavior change plus
classifyForLog. pushAttemptTimeout is now a var (not const) so tests
can shorten it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 6819da9e1737
Findings from /simplify cleanup pass:
- Extract reportPushFailure to dedupe the sync-failure and retry-failure
tails of doPushBranch (suffix selection, INFO log, warning, hint).
Drops ~15 lines and keeps the two branches in sync by construction.
- Replace classifyForLog's bare string literals with named constants
(pushClassTimedOut, pushClassProtectedRef, pushClassNonFastForward,
pushClassOther) so log consumers and tests share one source of truth.
- Fix aliasing smell in fetchAndRebaseSessionsCommon: drop the
`localCtx, cancel := …; ctx = localCtx` half-rename in favour of the
original `ctx, cancel := context.WithTimeout(…)` pattern.
- Rewrite pushStderr doc comment to describe the actual capture path
(captureStderr) and document why the standalone hint helpers stay on
os.Stderr — the function withPushStderr it referenced never existed.
- Strengthen TestFetchAndRebase_FetchFailure_IncludesUnderlyingError:
assert on "fetch failed: " prefix + non-empty trimmed detail instead of
just len(msg) > prefixLen, which would pass with any single character.
- Drop sleep 5 -> sleep 3 in the execx regression-pinning test. Keeps
the 2s assertion gap, saves ~2s on every test run (execx package time
dropped 5.2s -> 3.4s).
- Remove numbered // (1) // (2) // (4b) scaffolding from test docstrings
and trim WHAT-comments ("Var (not const)…", "Try pushing again after
rebase.", "Push failed — likely non-fast-forward…").
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: bbc8a4322628
Two review findings against the previous two commits: Finding 1: outer-context cancellation was only honored after the first push attempt. doPushBranch checked ctx.Err() once before deciding to sync, but the sync-failure and retry-failure tails unconditionally flowed through reportPushFailure, so a Ctrl-C or hook deadline arriving during fetch+rebase or during the retry push still produced misleading "Warning: couldn't sync …" / "failed to push … after sync" output. Move the ctx.Err() check into reportPushFailure itself so every cancellation point in the cascade collapses to the same silent bail: empty dot suffix, no log, no warning. New TestReportPushFailure_BailsOnOuter- ContextCancel pins the contract. Finding 2: KillOnCancel on Windows only terminated the direct child. CREATE_NEW_PROCESS_GROUP isolates console signals but does not propagate SIGKILL-equivalent semantics to descendants, and Process.Kill walks no tree — git-remote-https and credential helpers survived the cancel and kept the inherited stdio pipes open. Replace the direct kill with taskkill.exe /F /T /PID which walks the descendant chain. Fall back to Process.Kill if taskkill is unavailable; WaitDelay still bounds the user-visible hang in that case. Cross-builds under GOOS=windows; runtime test would require a Windows environment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Entire-Checkpoint: d14d590faf48
TestReportPushFailure_BailsOnOuterContextCancel previously called t.Parallel() and then reassigned the package-global pushStderr, which races with any non-parallel test that uses captureStderr (the shared helper that also swaps pushStderr). Under the race detector this would surface as a data race; in practice the t.Cleanup ordering could leak a closed pipe into another test. Add an `out io.Writer` field to reportPushFailureArgs and have doPushBranch pass pushStderr at the two call sites. The test now injects a local strings.Builder and never touches the global, so t.Parallel() is sound. Verified with `go test -race`. pushStderr itself remains as the writer used directly by doPushBranch (dot lines, timed-out warning, hint). Plumbing it through to every call site is a larger refactor and out of scope. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Entire-Checkpoint: c87f60db628c
There was a problem hiding this comment.
Pull request overview
This PR makes checkpoint branch pushing (entire/checkpoints/v1) bounded and diagnosable by ensuring git subprocess cancellation reliably terminates descendant helper processes (preventing indefinite hangs), and by improving timeout/error messaging plus INFO-level logging so doctor bundles capture push attempts.
Changes:
- Add
execx.KillOnCancelto kill the full descendant process tree on context cancellation (Unix via process groups; Windows viataskkill /T), withWaitDelayas a pipe-drain backstop. - Improve
doPushBranchpush UX and control flow: distinct push/fetch timeout sentinels, clearer dot-line suffixes, skip sync→retry cascade on push timeout, and avoid emptyfetch failed:messaging. - Add INFO logging for push attempts and add regression/unit tests covering the cancellation and messaging behavior.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| cmd/entire/cli/strategy/push_messaging_test.go | New tests covering timeout UX, cascade suppression, cancellation bail-out, INFO logging, and error classification. |
| cmd/entire/cli/strategy/push_common.go | Adds timeout sentinels, improved messaging/control flow, INFO logs, and shared failure-reporting helper. |
| cmd/entire/cli/strategy/push_common_test.go | Updates stderr capture helper to also redirect pushStderr for new output plumbing. |
| cmd/entire/cli/execx/killoncancel_windows.go | Windows implementation of KillOnCancel (process-group isolation + taskkill /T). |
| cmd/entire/cli/execx/killoncancel_unix.go | Unix implementation of KillOnCancel (Setpgid + kill process group). |
| cmd/entire/cli/execx/killoncancel_unix_test.go | Unix tests pinning the stdlib hang and verifying KillOnCancel unblocks promptly. |
| cmd/entire/cli/checkpoint/remote/git.go | Wires KillOnCancel into all git subprocesses created by newCommand. |
| cmd/entire/cli/checkpoint/remote/git_killoncancel_unix_test.go | Verifies newCommand applies KillOnCancel wiring on Unix. |
Comments suppressed due to low confidence (5)
cmd/entire/cli/execx/killoncancel_unix_test.go:59
- Same portability issue as above: the test guards with
LookPath("sh")but still invokes/bin/sh. Prefer using theLookPathresult (orshby name) so the test doesn’t depend on/bin/shexisting.
if _, err := exec.LookPath("sh"); err != nil {
t.Skip("sh not available")
}
ctx, cancel := context.WithTimeout(context.Background(), 200*time.Millisecond)
defer cancel()
cmd := exec.CommandContext(ctx, "/bin/sh", "-c", grandchildHoldingStdoutCmd)
KillOnCancel(cmd, 500*time.Millisecond)
cmd/entire/cli/strategy/push_messaging_test.go:167
- Same flakiness risk here:
pushAttemptTimeout = 1*time.Millisecondmay not reliably expire beforeremote.Pushreturns when pushing to a local bare repo. Use an immediate timeout (0 duration) or another deterministic approach so the test consistently exercises theerrPushTimedOutpath.
withShortPushTimeout(t, 1*time.Millisecond)
_, err := tryPushSessionsCommon(context.Background(), bareDir, paths.MetadataBranchName)
require.Error(t, err)
assert.ErrorIs(t, err, errPushTimedOut,
"err must wrap errPushTimedOut; got %v (class=%q)", err, classifyForLog(err))
}
cmd/entire/cli/strategy/push_common.go:137
push attempt failedlogs omit the target alias. Adding the same redacted/aliased target field used inpush attempt startwould make failure events diagnosable without relying on surrounding context.
logging.Info(ctx, "push attempt failed",
slog.String("branch", branchName),
slog.Duration("elapsed", time.Since(pushStart)),
slog.String("class", classifyForLog(err)),
)
cmd/entire/cli/strategy/push_common.go:209
push retry completedlogs omit the target alias, which makes it harder to interpret retries in aggregated logs. Consider logging the same redacted/aliased target field as the initial attempt.
logging.Info(ctx, "push retry completed",
slog.String("branch", branchName),
slog.Duration("elapsed", time.Since(retryStart)),
slog.Bool("up_to_date", retryResult.upToDate),
)
cmd/entire/cli/strategy/push_common.go:252
reportPushFailureINFO logs include branch/elapsed/class but not the push target. SincereportPushFailureArgsalready carriestarget, consider logging a redacted/aliased target value here as well to make sync/retry failures attributable in doctor bundles.
logging.Info(ctx, a.logMsg,
slog.String("branch", a.branch),
slog.Duration("elapsed", time.Since(a.start)),
slog.String("class", classifyForLog(a.err)),
)
| if _, err := exec.LookPath("sh"); err != nil { | ||
| t.Skip("sh not available") | ||
| } | ||
|
|
||
| ctx, cancel := context.WithTimeout(context.Background(), 200*time.Millisecond) | ||
| defer cancel() | ||
|
|
||
| cmd := exec.CommandContext(ctx, "/bin/sh", "-c", grandchildHoldingStdoutCmd) | ||
|
|
| // Force every push attempt to time out immediately. KillOnCancel + | ||
| // WaitDelay ensures the call returns within milliseconds. | ||
| withShortPushTimeout(t, 1*time.Millisecond) | ||
|
|
||
| restore := captureStderr(t) | ||
| err := doPushBranch(context.Background(), bareDir, paths.MetadataBranchName) | ||
| output := restore() | ||
|
|
| logging.Info(ctx, "push attempt completed", | ||
| slog.String("branch", branchName), | ||
| slog.Duration("elapsed", time.Since(pushStart)), | ||
| slog.Bool("up_to_date", result.upToDate), | ||
| ) |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit 8246e78. Configure here.
| slog.String("error", fetchErr.Error()), | ||
| slog.Duration("after", fetchAttemptTimeout), | ||
| ) | ||
| return errFetchTimedOut |
There was a problem hiding this comment.
Fetch timeout check can't distinguish inner from outer deadline
Medium Severity
In fetchAndRebaseSessionsCommon, ctx is shadowed at line 538 by context.WithTimeout, so the new timeout check at line 572 (errors.Is(ctx.Err(), context.DeadlineExceeded)) cannot distinguish whether the inner fetch deadline or the outer context deadline fired. In contrast, tryPushSessionsCommon correctly avoids this by using localCtx and guarding with ctx.Err() == nil. If the outer context's deadline fires during the fetch (e.g., hook-level timeout shorter than fetchAttemptTimeout), this incorrectly returns errFetchTimedOut instead of letting the outer cancellation propagate naturally.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 8246e78. Configure here.
There was a problem hiding this comment.
This sounds like we should fix it 😅
In fetchAndRebaseSessionsCommon, the timeout-wrapped context shadowed the outer ctx, so a hook-level deadline shorter than fetchAttemptTimeout was misreported as errFetchTimedOut. Mirror tryPushSessionsCommon's pattern: hold the wrapped context as localCtx and gate the sentinel on the outer ctx still being alive. Entire-Checkpoint: cce2e7fcfcd8
The tests guard with exec.LookPath("sh") but were invoking /bin/sh
directly, which would skip the guard's intent on systems where sh
lives outside /bin (some non-FHS layouts). Use the PATH-resolved
form so the guard's outcome actually governs the test.
Entire-Checkpoint: bd7c6a83418a
Forcing pushAttemptTimeout to 1ms still raced against a fast local-bare-repo push that could complete before the deadline fired. Setting it to 0 makes localCtx fire DeadlineExceeded synchronously, so the test always reaches the errPushTimedOut path. Entire-Checkpoint: faf57c42d164
Only "push attempt start" carried the target field; completion, failure, retry, sync-failure, and retry-failure entries omitted it. Doctor bundles collecting these had to infer the target from surrounding context, which breaks when multiple remotes (origin + checkpoint remote) push in the same hook invocation. Add target (redacted via displayPushTarget for URL targets) to every push-related log line. Entire-Checkpoint: 2c5cd80e2128
3dcadb8 to
7d7276b
Compare


https://entire.io/gh/entireio/cli/trails/365
Make a stuck
entire/checkpoints/v1push diagnosable and boundedA colleague's
git pushhung with an unbounded stream of dots on the pre-push hook's[entire] Pushing entire/checkpoints/v1 to origin...line. Only Ctrl-C broke it, and the resultingWarning: couldn't sync ...: fetch failed:had nothing after the colon — the doctor bundle showed no push events at all.Root cause
exec.CommandContext's default cancellation only SIGKILLs the direct child.git pushover HTTPS spawnsgit-remote-https+ credential helpers that inherit our stdio pipes; killing git leaves them alive holding stdout, socmd.CombinedOutput()blocks pasttryPushSessionsCommon's 2-minute deadline. The dots are 1-Hz progress emitted in-process; with no cancellation, they tick forever.Fixes
execx.KillOnCancel(cmd, waitDelay)— Unix usesSetpgid+Cancelkilling the process group; Windows usesCREATE_NEW_PROCESS_GROUP+taskkill.exe /F /T /PIDfor descendant-tree termination.WaitDelaybackstops pipe drain.newCommandincheckpoint/remoteapplies it to every git subprocess.doPushBranchUX: returns distincterrPushTimedOut/errFetchTimedOut; ends the dot line with" timed out"instead of trailing silently; skips the sync→retry cascade on timeout (same network condition would just trip the fetch deadline); falls back to the wrapped fetch error when git's output is empty so no more barefetch failed:.Tests
execx: pins the stdlib hang (TestExecCommandContext_HangsOnGrandchildHoldingPipe) and verifiesKillOnCancelreturns withinWaitDelay(~200 ms vs 5 s).strategy: covers timeout suffix, sync-cascade skip, outer-cancel bail, INFO log writes, sentinel propagation,classifyForLog, and thereportPushFailurechokepoint under-race.Net effect for the colleague's symptom
Next time GitHub stalls a push, the 2-minute timeout actually fires, the dot line ends with
timed out, a hint points atlog_level=DEBUG, and the bundle records the attempt. No infinite dots, no Ctrl-C needed.Note
Medium Risk
Touches git push/fetch execution and hook messaging, including new process-group kill behavior and timeout handling; while scoped to checkpoint syncing, it could change cancellation semantics across platforms and affect reliability if misconfigured.
Overview
Prevents checkpoint
git push/fetchfrom hanging indefinitely on cancellation. All git subprocesses created incheckpoint/remote/newCommandare now configured with a new cross-platformexecx.KillOnCancelhelper that kills the entire process tree/process group and applies a boundedWaitDelayto unblockCombinedOutputwhen helpers keep stdio pipes open.Improves checkpoint push UX and diagnosability.
doPushBranchnow logs push attempts at INFO with a sanitizedclassfield, introduces explicit push/fetch timeout sentinels, prints clearer timeout messaging (including a DEBUG logging hint), skips the sync→retry cascade on push timeouts, bails quietly when the outer context is cancelled, and ensures fetch failures include a non-empty underlying error when git produced no output.Adds focused regression tests. New unix-only tests pin the stdlib hang behavior and verify
KillOnCancelwiring, plus strategy tests covering timeout messaging, cancellation behavior, logging, and error classification.Reviewed by Cursor Bugbot for commit 8246e78. Configure here.