test(evm): cap child_process.exec in lib.js to surface stalled commands#3526
test(evm): cap child_process.exec in lib.js to surface stalled commands#3526wen-coding wants to merge 2 commits into
Conversation
The Autobahn EVM integration matrix has hit multiple 30-minute job timeouts where hardhat went silent between test files with no error. The orphan-process listing at cleanup consistently shows a docker exec (running `seid tx evm send` or similar) still alive — `child_process.exec` has no timeout by default, so a stalled CLI invocation eats the entire job budget instead of surfacing as a clear error. Add a default 60s timeout to execCommand with a SIGKILL kill signal. On expiry, throw an error containing the offending command so the next occurrence is a 60s actionable error instead of a 30-minute mystery. Override via EXEC_TIMEOUT_MS for tests that legitimately need longer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR SummaryLow Risk Overview
Intent: turn indefinite stalls (observed as ~30m CI job timeouts with orphaned Reviewed by Cursor Bugbot for commit c1920c4. Bugbot is set up for automated code reviews on this repo. Configure here. |
|
The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3526 +/- ##
==========================================
- Coverage 59.15% 58.30% -0.85%
==========================================
Files 2213 2140 -73
Lines 182710 174341 -8369
==========================================
- Hits 108077 101650 -6427
+ Misses 64930 63667 -1263
+ Partials 9703 9024 -679
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
Per cursor bugbot: error.signal is set on any signal-terminated process (external OOM kill, runner cleanup, etc.), not just timeout kills. error.killed is the precise Node-set-the-killer signal. Narrowing the condition prevents non-timeout signal deaths from being mis-reported as "execCommand timed out". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit c1920c4. Configure here.
| const summary = command.length > 200 ? command.slice(0, 197) + "..." : command | ||
| reject(new Error(`execCommand timed out after ${timeoutMs}ms: ${summary}`)) | ||
| return | ||
| } |
There was a problem hiding this comment.
Timeout check misattributes maxBuffer-exceeded kills as timeouts
Low Severity
The comment states error.killed is set "only when Node killed the child via the timeout option," but this is incorrect. Node's exec internally calls child.kill() for both timeout expiry and maxBuffer exceeded (ERR_CHILD_PROCESS_STDIO_MAXBUFFER), setting error.killed = true in both cases. A maxBuffer overflow would be misreported as "execCommand timed out after …", hiding the real cause. Checking error.killed && !error.code would correctly distinguish timeout (where error.code is null) from maxBuffer exceeded (where error.code is 'ERR_CHILD_PROCESS_STDIO_MAXBUFFER').
Reviewed by Cursor Bugbot for commit c1920c4. Configure here.


Summary
execCommandin contracts/test/lib.js wrapschild_process.execwith no timeout. The Autobahn EVM integration matrix has hit multiple 30‑minute job timeouts where the hardhat process went silent at a test boundary, with the orphan‑process listing at cleanup consistently showing a stalleddocker exec ... seid ...child still alive. The job then consumed its entire wall‑clock budget instead of surfacing a useful error.This caps
execCommandwith a configurable timeout (default 60s,SIGKILL). On expiry, the rejection includes the offending command so the next occurrence is a 60s actionable error instead of a 30‑minute mystery. Override viaEXEC_TIMEOUT_MSfor tests that legitimately need longer.Observed hang pattern (7 occurrences in 3 days)
Each ran for exactly ~30 minutes (job timeout). Investigation of one of the artifacts (26648905384): validators were perfectly healthy (continuing to produce blocks) the whole time, no test‑sender activity reached them during the silence — i.e., the hardhat process was alive but never sent another RPC, consistent with a stalled child process here.
Trade-off
docker execcalls in the suite complete in ~100–200ms; 60s is comfortably above the worst observed normal case.EXEC_TIMEOUT_MSoverrides per-invocation.Test plan
execCommand timed out … <command>error instead of a 30‑min cancel — that error itself becomes the diagnostic.🤖 Generated with Claude Code