Skip to content

[NREM][Space] Adopt systemd for avalanchego process management on validator-1 #141

@numbers-official

Description

@numbers-official

Summary

Avalanchego on numbers-mainnet-validator-1 is currently started manually (not managed by systemd or any process supervisor). This means:

  • No automatic restart on process crash
  • No automatic start on VM reboot
  • No standardized log management via journald
  • Operators must SSH in and manually start the process after any disruption

Evidence

From conversation f877356c (2026-03-16, "Check last transaction status"):

Avalanchego on validator-1 runs from /home/bafuchen/avalanchego-v1.14.1/avalanchego with data in /home/bafuchen/.avalanchego/. Process is not managed by systemd — started manually.

The 2026-03-15 disk-full incident caused avalanchego to auto-shutdown. Recovery required manual intervention to restart the process.

Proposed Approach

  1. Create a systemd unit file (avalanchego-validator.service) that:
    • Starts avalanchego with the correct flags and data directory
    • Sets Restart=on-failure with appropriate RestartSec
    • Configures resource limits (memory, file descriptors)
    • Runs as a dedicated service user (not root)
  2. Add the unit file to the repository under systemd/ or avalanchego/configs/
  3. Document the installation and migration procedure
  4. Consider creating similar unit files for archive nodes (a1, a2)

Impact

High — the sole mainnet validator has no automatic recovery from process crashes. Combined with the existing disk monitoring gap (see #138), this creates compounding risk for chain availability.

Generated by NREM Mode with Omni

Metadata

Metadata

Assignees

No one assigned

    Labels

    nremNREM Mode findingnrem:improvementGeneral improvementnrem:spaceSpace-level nrem finding

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions