Summary
Avalanchego on numbers-mainnet-validator-1 is currently started manually (not managed by systemd or any process supervisor). This means:
- No automatic restart on process crash
- No automatic start on VM reboot
- No standardized log management via journald
- Operators must SSH in and manually start the process after any disruption
Evidence
From conversation f877356c (2026-03-16, "Check last transaction status"):
Avalanchego on validator-1 runs from /home/bafuchen/avalanchego-v1.14.1/avalanchego with data in /home/bafuchen/.avalanchego/. Process is not managed by systemd — started manually.
The 2026-03-15 disk-full incident caused avalanchego to auto-shutdown. Recovery required manual intervention to restart the process.
Proposed Approach
- Create a systemd unit file (
avalanchego-validator.service) that:
- Starts avalanchego with the correct flags and data directory
- Sets
Restart=on-failure with appropriate RestartSec
- Configures resource limits (memory, file descriptors)
- Runs as a dedicated service user (not root)
- Add the unit file to the repository under
systemd/ or avalanchego/configs/
- Document the installation and migration procedure
- Consider creating similar unit files for archive nodes (a1, a2)
Impact
High — the sole mainnet validator has no automatic recovery from process crashes. Combined with the existing disk monitoring gap (see #138), this creates compounding risk for chain availability.
Generated by NREM Mode with Omni
Summary
Avalanchego on
numbers-mainnet-validator-1is currently started manually (not managed by systemd or any process supervisor). This means:Evidence
From conversation
f877356c(2026-03-16, "Check last transaction status"):The 2026-03-15 disk-full incident caused avalanchego to auto-shutdown. Recovery required manual intervention to restart the process.
Proposed Approach
avalanchego-validator.service) that:Restart=on-failurewith appropriateRestartSecsystemd/oravalanchego/configs/Impact
High — the sole mainnet validator has no automatic recovery from process crashes. Combined with the existing disk monitoring gap (see #138), this creates compounding risk for chain availability.
Generated by NREM Mode with Omni