fix: raft HA production hardening — leader fencing, log compaction, e… #2274
ci.yml
on: push
Determine Image Tag
3s
test
/
Run Unit Tests
2m 51s
test
/
Run Integration Tests
2m 4s
test
/
Build All ev-node Binaries
1m 49s
lint
/
golangci-lint
1m 53s
lint
/
hadolint
7s
lint
/
yamllint
9s
lint
/
markdown-lint
23s
lint
/
goreleaser-check
9s
test
/
Go Mod Tidy Check
33s
proto
/
buf-check
5s
Matrix: docker / build-images
test
/
Combine and Upload Coverage
8s
test
/
Run E2E System Tests
12m 57s
test
/
Run EVM Execution Tests
2m 4s
docker-tests
/
Docker E2E Tests
4m 52s
docker-tests
/
Docker Upgrade E2E Tests
22s
docker-tests
/
Docker Compatibility E2E Tests
2m 30s
Annotations
9 errors, 7 warnings, and 1 notice
|
lint / hadolint:
tools/local-da/Dockerfile#L13
DL3003 warning: Use WORKDIR to switch to a directory
|
|
lint / hadolint:
apps/testapp/Dockerfile#L24
DL3003 warning: Use WORKDIR to switch to a directory
|
|
lint / hadolint:
apps/testapp/Dockerfile#L4
DL3008 warning: Pin versions in apt get install. Instead of `apt-get install <package>` use `apt-get install <package>=<version>`
|
|
lint / markdown-lint
Process completed with exit code 1.
|
|
lint / golangci-lint
it retains snapshot files on disk\n' +
|
|
lint / golangci-lint
it retains snapshot files on disk\n' +
|
|
lint / golangci-lint
it retains snapshot files on disk\n' +
|
|
test / Run E2E System Tests
it retains snapshot files on disk\n (NewFileSnapshotStore retain param), not log-entry frequency\n- Extract buildRaftConfig helper from NewNode to enable config wiring tests\n- Add TestNodeResignLeader_NotLeaderNoop (non-nil node, nil raft → noop)\n- Add TestNewNode_SnapshotConfigApplied (table-driven, verifies\n SnapshotThreshold and TrailingLogs wiring with custom and zero values)\n\nCo-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>\n\n* fix(raft): address code review issues — ShutdownTimeout, resign fence, election validation\n\n- Add ShutdownTimeout field (default 5s) to raft Config so Stop() drains\n committed logs with a meaningful timeout instead of the 200ms SendTimeout\n- Wrap ResignLeader() in a 3s goroutine+select fence in the SIGTERM handler\n so a hung leadership transfer cannot block graceful shutdown indefinitely\n- Validate ElectionTimeout >= HeartbeatTimeout in RaftConfig.Validate() to\n prevent hashicorp/raft panicking at startup with an invalid config\n- Fix stale \"NewNode creates a new raft node\" comment that had migrated onto\n buildRaftConfig after the function was extracted\n\nCo-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>\n\n* style(raft): fix gci struct field alignment in node_test.go\n\ngofmt/gci requires minimal alignment; excessive spaces in the\nTestNewNode_SnapshotConfigApplied struct literal caused a lint failure.\n\nCo-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>\n\n* test: improve patch coverage for raft shutdown and resign paths\n\nAdd unit tests for lines flagged by Codecov:\n- boltTxClosedFilter.Write: filter drops \"tx closed\", forwards others\n- buildRaftConfig: ElectionTimeout > 0 applied, zero uses default\n- FullNode.ResignLeader: nil raftNode no-op; non-leader raftNode no-op\n- Syncer.Stop: raftRetriever.Stop is called when raftRetriever is set\n- Syncer.RecoverFromRaft: GetHeader failure when local state is ahead of\n stale raft snapshot returns \"cannot verify hash\" error\n\nCo-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>\n\n* fix(config): reject negative ElectionTimeout in RaftConfig.Validate\n\nA negative ElectionTimeout was silently ignored (buildRaftConfig only\napplies values > 0), allowing a misconfigured node to start with the\nlibrary default instead of failing fast. Add an explicit < 0 check\nthat returns an error; 0 remains valid as the \"use library default\"\nsentinel.\n\nCo-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>\n\n* fix(raft): preserve stdlib logger writer in bolt filter; propagate ctx through ResignLeader\n\n- suppressBoltNoise.Do now wraps log.Writer() instead of os.Stderr so any\n existing stdlib logger redirection is preserved rather than clobbered\n- ResignLeader now accepts a context.Context: leadershipTransfer runs in a\n goroutine and a select abandons the caller at ctx.Done(), returning\n ctx.Err(); the goroutine itself exits once the inner raft transfer\n completes (bounded by ElectionTimeout)\n\nCo-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>\n\n* fix(node): propagate context through LeaderResigner.ResignLeader interface\n\n- LeaderResigner.ResignLeader() → ResignLeader(ctx context.Context) error\n- FullNode.ResignLeader passes ctx down to raft.Node.ResignLeader\n- run_node.go calls resigner.ResignLeader(resignCtx) directly — no wrapper\n goroutine/select needed; context.DeadlineExceeded vs other errors are\n logged distinctly\n- Merge TestFullNode_ResignLeader_NilRaftNode and\n TestFullNode_ResignLeader_NonLeaderRaftNode into single table-driven test\n\nCo-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>\n\n* fix(raft): abdicate leadership when store is significantly behind raft state\n\nWhen a node wins election but its local store is more than 1 block behind\nthe raft FSM state, RecoverFromRaft cannot replay the gap (it only handles\nthe single latest block in the raft snapshot). Previously the node would\nlog \"recovery successful\" and start leader operations anyway, then stall\nblock production while the executor repeatedly failed to sync the missing\
|
|
test / Run E2E System Tests
it retains snapshot files on disk\n (NewFileSnapshotStore retain param), not log-entry frequency\n- Extract buildRaftConfig helper from NewNode to enable config wiring tests\n- Add TestNodeResignLeader_NotLeaderNoop (non-nil node, nil raft → noop)\n- Add TestNewNode_SnapshotConfigApplied (table-driven, verifies\n SnapshotThreshold and TrailingLogs wiring with custom and zero values)\n\nCo-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>\n\n* fix(raft): address code review issues — ShutdownTimeout, resign fence, election validation\n\n- Add ShutdownTimeout field (default 5s) to raft Config so Stop() drains\n committed logs with a meaningful timeout instead of the 200ms SendTimeout\n- Wrap ResignLeader() in a 3s goroutine+select fence in the SIGTERM handler\n so a hung leadership transfer cannot block graceful shutdown indefinitely\n- Validate ElectionTimeout >= HeartbeatTimeout in RaftConfig.Validate() to\n prevent hashicorp/raft panicking at startup with an invalid config\n- Fix stale \"NewNode creates a new raft node\" comment that had migrated onto\n buildRaftConfig after the function was extracted\n\nCo-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>\n\n* style(raft): fix gci struct field alignment in node_test.go\n\ngofmt/gci requires minimal alignment; excessive spaces in the\nTestNewNode_SnapshotConfigApplied struct literal caused a lint failure.\n\nCo-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>\n\n* test: improve patch coverage for raft shutdown and resign paths\n\nAdd unit tests for lines flagged by Codecov:\n- boltTxClosedFilter.Write: filter drops \"tx closed\", forwards others\n- buildRaftConfig: ElectionTimeout > 0 applied, zero uses default\n- FullNode.ResignLeader: nil raftNode no-op; non-leader raftNode no-op\n- Syncer.Stop: raftRetriever.Stop is called when raftRetriever is set\n- Syncer.RecoverFromRaft: GetHeader failure when local state is ahead of\n stale raft snapshot returns \"cannot verify hash\" error\n\nCo-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>\n\n* fix(config): reject negative ElectionTimeout in RaftConfig.Validate\n\nA negative ElectionTimeout was silently ignored (buildRaftConfig only\napplies values > 0), allowing a misconfigured node to start with the\nlibrary default instead of failing fast. Add an explicit < 0 check\nthat returns an error; 0 remains valid as the \"use library default\"\nsentinel.\n\nCo-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>\n\n* fix(raft): preserve stdlib logger writer in bolt filter; propagate ctx through ResignLeader\n\n- suppressBoltNoise.Do now wraps log.Writer() instead of os.Stderr so any\n existing stdlib logger redirection is preserved rather than clobbered\n- ResignLeader now accepts a context.Context: leadershipTransfer runs in a\n goroutine and a select abandons the caller at ctx.Done(), returning\n ctx.Err(); the goroutine itself exits once the inner raft transfer\n completes (bounded by ElectionTimeout)\n\nCo-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>\n\n* fix(node): propagate context through LeaderResigner.ResignLeader interface\n\n- LeaderResigner.ResignLeader() → ResignLeader(ctx context.Context) error\n- FullNode.ResignLeader passes ctx down to raft.Node.ResignLeader\n- run_node.go calls resigner.ResignLeader(resignCtx) directly — no wrapper\n goroutine/select needed; context.DeadlineExceeded vs other errors are\n logged distinctly\n- Merge TestFullNode_ResignLeader_NilRaftNode and\n TestFullNode_ResignLeader_NonLeaderRaftNode into single table-driven test\n\nCo-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>\n\n* fix(raft): abdicate leadership when store is significantly behind raft state\n\nWhen a node wins election but its local store is more than 1 block behind\nthe raft FSM state, RecoverFromRaft cannot replay the gap (it only handles\nthe single latest block in the raft snapshot). Previously the node would\nlog \"recovery successful\" and start leader operations anyway, then stall\nblock production while the executor repeatedly failed to sync the missing\
|
|
lint / goreleaser-check
Node.js 20 actions are deprecated. The following actions are running on Node.js 20 and may not work as expected: technote-space/get-diff-action@v6.1.2. Actions will be forced to run with Node.js 24 by default starting June 2nd, 2026. Node.js 20 will be removed from the runner on September 16th, 2026. Please check if updated versions of these actions are available that support Node.js 24. To opt into Node.js 24 now, set the FORCE_JAVASCRIPT_ACTIONS_TO_NODE24=true environment variable on the runner or in your workflow file. Once Node.js 24 becomes the default, you can temporarily opt out by setting ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION=true. For more information see: https://github.blog/changelog/2025-09-19-deprecation-of-node-20-on-github-actions-runners/
|
|
lint / yamllint
Node.js 20 actions are deprecated. The following actions are running on Node.js 20 and may not work as expected: technote-space/get-diff-action@v6.1.2. Actions will be forced to run with Node.js 24 by default starting June 2nd, 2026. Node.js 20 will be removed from the runner on September 16th, 2026. Please check if updated versions of these actions are available that support Node.js 24. To opt into Node.js 24 now, set the FORCE_JAVASCRIPT_ACTIONS_TO_NODE24=true environment variable on the runner or in your workflow file. Once Node.js 24 becomes the default, you can temporarily opt out by setting ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION=true. For more information see: https://github.blog/changelog/2025-09-19-deprecation-of-node-20-on-github-actions-runners/
|
|
lint / markdown-lint
Node.js 20 actions are deprecated. The following actions are running on Node.js 20 and may not work as expected: actions/setup-node@v3, technote-space/get-diff-action@v6.1.2. Actions will be forced to run with Node.js 24 by default starting June 2nd, 2026. Node.js 20 will be removed from the runner on September 16th, 2026. Please check if updated versions of these actions are available that support Node.js 24. To opt into Node.js 24 now, set the FORCE_JAVASCRIPT_ACTIONS_TO_NODE24=true environment variable on the runner or in your workflow file. Once Node.js 24 becomes the default, you can temporarily opt out by setting ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION=true. For more information see: https://github.blog/changelog/2025-09-19-deprecation-of-node-20-on-github-actions-runners/
|
|
lint / golangci-lint
Node.js 20 actions are deprecated. The following actions are running on Node.js 20 and may not work as expected: technote-space/get-diff-action@v6.1.2. Actions will be forced to run with Node.js 24 by default starting June 2nd, 2026. Node.js 20 will be removed from the runner on September 16th, 2026. Please check if updated versions of these actions are available that support Node.js 24. To opt into Node.js 24 now, set the FORCE_JAVASCRIPT_ACTIONS_TO_NODE24=true environment variable on the runner or in your workflow file. Once Node.js 24 becomes the default, you can temporarily opt out by setting ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION=true. For more information see: https://github.blog/changelog/2025-09-19-deprecation-of-node-20-on-github-actions-runners/
|
|
docker / Build ev-node-grpc
Node.js 20 actions are deprecated. The following actions are running on Node.js 20 and may not work as expected: docker/login-action@v3. Actions will be forced to run with Node.js 24 by default starting June 2nd, 2026. Node.js 20 will be removed from the runner on September 16th, 2026. Please check if updated versions of these actions are available that support Node.js 24. To opt into Node.js 24 now, set the FORCE_JAVASCRIPT_ACTIONS_TO_NODE24=true environment variable on the runner or in your workflow file. Once Node.js 24 becomes the default, you can temporarily opt out by setting ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION=true. For more information see: https://github.blog/changelog/2025-09-19-deprecation-of-node-20-on-github-actions-runners/
|
|
docker / Build ev-node-testapp
Node.js 20 actions are deprecated. The following actions are running on Node.js 20 and may not work as expected: docker/login-action@v3. Actions will be forced to run with Node.js 24 by default starting June 2nd, 2026. Node.js 20 will be removed from the runner on September 16th, 2026. Please check if updated versions of these actions are available that support Node.js 24. To opt into Node.js 24 now, set the FORCE_JAVASCRIPT_ACTIONS_TO_NODE24=true environment variable on the runner or in your workflow file. Once Node.js 24 becomes the default, you can temporarily opt out by setting ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION=true. For more information see: https://github.blog/changelog/2025-09-19-deprecation-of-node-20-on-github-actions-runners/
|
|
docker / Build ev-node-evm
Node.js 20 actions are deprecated. The following actions are running on Node.js 20 and may not work as expected: docker/login-action@v3. Actions will be forced to run with Node.js 24 by default starting June 2nd, 2026. Node.js 20 will be removed from the runner on September 16th, 2026. Please check if updated versions of these actions are available that support Node.js 24. To opt into Node.js 24 now, set the FORCE_JAVASCRIPT_ACTIONS_TO_NODE24=true environment variable on the runner or in your workflow file. Once Node.js 24 becomes the default, you can temporarily opt out by setting ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION=true. For more information see: https://github.blog/changelog/2025-09-19-deprecation-of-node-20-on-github-actions-runners/
|
|
Determine Image Tag
Using branch/tag-based tag: main
|
Artifacts
Produced during runtime
| Name | Size | Digest | |
|---|---|---|---|
|
evstack~ev-node~58ZW12.dockerbuild
|
92.5 KB |
sha256:a17739c01e901eded82d29cfc6f098640cdf3be82258d4ee764337a4a869e7da
|
|
|
evstack~ev-node~6LPD08.dockerbuild
|
44.9 KB |
sha256:50e1b5fb3a144aea8e5dfb4037915d4be3da1c9a1e192e14c266446b861d987e
|
|
|
evstack~ev-node~OFDFC7.dockerbuild
|
87.3 KB |
sha256:14a1da8c8e343eb82d841108584268889e8fe3ea243301ce11ee865eb16bb30c
|
|
|
evstack~ev-node~Z4KYWJ.dockerbuild
|
95.9 KB |
sha256:c38edcb2568755cecd0e6f6ade72d3a80b687cd2d120cb7f569f918be0e1e6ce
|
|
|
integration-test-coverage-report-d9046e6706be9c4b1008be08c30d611e16c4388c
|
2.34 KB |
sha256:3fcfb620e59b548bfd117ac49dd997d9bead1fcb6e8ffce1cb6a12e0a5a413ce
|
|
|
unit-test-coverage-report-d9046e6706be9c4b1008be08c30d611e16c4388c
|
81.9 KB |
sha256:a913ca0e3b12a95545bda7e7035f7a2bda11670eb79ec186f16de7d24890c098
|
|