fix: persist validator+miner state across container restarts by LandynDev · Pull Request #323 · entrius/allways

LandynDev · 2026-05-12T23:05:05Z

Why

The validator running on lena tonight lost its entire `state.db` when
watchtower pulled the new v1.0.2 image — the writable layer of the old
container was destroyed by `cleanup=true`. Result: first post-restart
scoring round routed 100 % of emissions to UID 53 (RECYCLE).

We rely on SQLite persistence for `rate_events`, `swap_outcomes`,
and `pending_confirms`. The code was written assuming `~/.allways/`
survives restarts; the deployment config was defeating that.

What

1. Volume mount (deploy fix)

`docker-compose.{vali,miner}.yml` were only binding `~/.bittensor/wallets`.
Add `./data/allways:/root/.allways` so state.db (validator) and
sent_cache / rate-posted flags (miner) survive watchtower-driven
container recreates.

2. Rate bootstrap from chain (code fix)

Even with the mount in place, a brand-new validator or one whose
state.db was wiped would still hit the same bug on its first scoring
round — `reconstruct_window_start_state` had no rate visible at
`window_start`, so every miner read as "no rate posted" and the
entire pool recycled.

Add `Validator.bootstrap_miner_rates()` (~30 lines) called once after
`event_watcher.initialize`. Reads current on-chain commitments and
seeds one anchor rate event per (hotkey, direction) at
`cursor = current_block − SCORING_WINDOW_BLOCKS`. Only inserts if no
event already exists at that cursor (idempotent across restarts).
Mirrors the active-flag anchoring `event_watcher.initialize` already
does, via a normal contract storage read (no archive node needed).

Operator action after merge

For existing deployments, after pulling and recreating:

```bash
mkdir -p ./data/allways # next to docker-compose.yml
docker compose up -d # picks up the new mount
```

Subsequent watchtower updates preserve state automatically.

Test plan

`pytest tests/` → 411 passed
`ruff format && ruff check` clean
After deploy, confirm next `V1 scoring` round on lena shows non-zero distributed (i.e. UID 136/189 actually credited instead of 100 % recycle)
Confirm state.db persists across a manual `docker compose up -d --force-recreate`

Two compounding bugs were causing the first post-restart scoring round to route 100% of emissions to RECYCLE_UID (UID 53): 1) docker-compose.{vali,miner}.yml only mounted ~/.bittensor/wallets, not ~/.allways. State.db (rate_events, swap_outcomes, pending_confirms for the validator; sent_cache + rate_posted flags for the miner) lived in the container's writable layer and got destroyed every time watchtower pulled a new image. Mount ./data/allways:/root/.allways so state survives container recreate. 2) Even after fixing the mount, a brand-new validator (or one whose state.db was wiped) had no rate event visible at window_start, so reconstruct_window_start_state returned an empty rates dict on the first scoring pass — every miner read as 'no rate posted', no crown was awarded, full pool recycled. Add bootstrap_miner_rates() which reads current on-chain commitments at init and seeds one anchor rate event per (hotkey, direction) at cursor. Mirrors what event_watcher.initialize already does for active flags. Tested: 411 tests pass. Lena confirmed only the wallets dir was bound mounted, so the writable layer was being destroyed on every watchtower update.

xiao-xiao-mao Bot added the bug Something isn't working label May 12, 2026

anderdc approved these changes May 12, 2026

View reviewed changes

anderdc merged commit ae71c70 into test May 12, 2026
3 checks passed

anderdc deleted the fix/persist-state-across-restarts branch May 12, 2026 23:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: persist validator+miner state across container restarts#323

fix: persist validator+miner state across container restarts#323
anderdc merged 1 commit into
testfrom
fix/persist-state-across-restarts

LandynDev commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

LandynDev commented May 12, 2026

Why

What

1. Volume mount (deploy fix)

2. Rate bootstrap from chain (code fix)

Operator action after merge

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants