Skip to content

fix: persist validator+miner state across container restarts#323

Merged
anderdc merged 1 commit into
testfrom
fix/persist-state-across-restarts
May 12, 2026
Merged

fix: persist validator+miner state across container restarts#323
anderdc merged 1 commit into
testfrom
fix/persist-state-across-restarts

Conversation

@LandynDev
Copy link
Copy Markdown
Collaborator

Why

The validator running on lena tonight lost its entire `state.db` when
watchtower pulled the new v1.0.2 image — the writable layer of the old
container was destroyed by `cleanup=true`. Result: first post-restart
scoring round routed 100 % of emissions to UID 53 (RECYCLE).

We rely on SQLite persistence for `rate_events`, `swap_outcomes`,
and `pending_confirms`. The code was written assuming `~/.allways/`
survives restarts; the deployment config was defeating that.

What

1. Volume mount (deploy fix)

`docker-compose.{vali,miner}.yml` were only binding `~/.bittensor/wallets`.
Add `./data/allways:/root/.allways` so state.db (validator) and
sent_cache / rate-posted flags (miner) survive watchtower-driven
container recreates.

2. Rate bootstrap from chain (code fix)

Even with the mount in place, a brand-new validator or one whose
state.db was wiped would still hit the same bug on its first scoring
round — `reconstruct_window_start_state` had no rate visible at
`window_start`, so every miner read as "no rate posted" and the
entire pool recycled.

Add `Validator.bootstrap_miner_rates()` (~30 lines) called once after
`event_watcher.initialize`. Reads current on-chain commitments and
seeds one anchor rate event per (hotkey, direction) at
`cursor = current_block − SCORING_WINDOW_BLOCKS`. Only inserts if no
event already exists at that cursor (idempotent across restarts).
Mirrors the active-flag anchoring `event_watcher.initialize` already
does, via a normal contract storage read (no archive node needed).

Operator action after merge

For existing deployments, after pulling and recreating:

```bash
mkdir -p ./data/allways # next to docker-compose.yml
docker compose up -d # picks up the new mount
```

Subsequent watchtower updates preserve state automatically.

Test plan

  • `pytest tests/` → 411 passed
  • `ruff format && ruff check` clean
  • After deploy, confirm next `V1 scoring` round on lena shows non-zero distributed (i.e. UID 136/189 actually credited instead of 100 % recycle)
  • Confirm state.db persists across a manual `docker compose up -d --force-recreate`

Two compounding bugs were causing the first post-restart scoring round
to route 100% of emissions to RECYCLE_UID (UID 53):

1) docker-compose.{vali,miner}.yml only mounted ~/.bittensor/wallets,
   not ~/.allways. State.db (rate_events, swap_outcomes,
   pending_confirms for the validator; sent_cache + rate_posted flags
   for the miner) lived in the container's writable layer and got
   destroyed every time watchtower pulled a new image. Mount
   ./data/allways:/root/.allways so state survives container recreate.

2) Even after fixing the mount, a brand-new validator (or one whose
   state.db was wiped) had no rate event visible at window_start, so
   reconstruct_window_start_state returned an empty rates dict on the
   first scoring pass — every miner read as 'no rate posted', no
   crown was awarded, full pool recycled. Add bootstrap_miner_rates()
   which reads current on-chain commitments at init and seeds one
   anchor rate event per (hotkey, direction) at cursor. Mirrors what
   event_watcher.initialize already does for active flags.

Tested: 411 tests pass. Lena confirmed only the wallets dir was bound
mounted, so the writable layer was being destroyed on every watchtower
update.
@xiao-xiao-mao xiao-xiao-mao Bot added the bug Something isn't working label May 12, 2026
@anderdc anderdc merged commit ae71c70 into test May 12, 2026
3 checks passed
@anderdc anderdc deleted the fix/persist-state-across-restarts branch May 12, 2026 23:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants