Skip to content

fix(daemon): adaptive EMA integration + shard.ttl 3-threshold shape (Phase 6 soak)#146

Merged
githubrobbi merged 2 commits intomainfrom
fix/phase6-soak-ema-and-ttl-event-shape
May 7, 2026
Merged

fix(daemon): adaptive EMA integration + shard.ttl 3-threshold shape (Phase 6 soak)#146
githubrobbi merged 2 commits intomainfrom
fix/phase6-soak-ema-and-ttl-event-shape

Conversation

@githubrobbi
Copy link
Copy Markdown
Collaborator

Summary

Two root-cause fixes from the 2026-05-07 Phase 6 24-h min_tier="WARM" soak plus a corresponding update to the soak validator so future runs measure the right thing.

Findings & fixes

1. DriveStats::decay_ema only decayed — never integrated new queries

The 24-h soak captured rate_qpm=0.0 across all 2882 shard.ttl events for the queried drive. Root cause: decay_ema only applied exponential decay; mark_query_at bumps were never folded into the EMA. The adaptive bonus formula in crate::cache::policy::warm_ttl therefore could never engage in production.

Fix (crates/uffs-daemon/src/cache/shard/drive_stats.rs): standard half-life EMA blend new = decay·prev + (1-decay)·sample with sample = delta_queries/elapsed_secs, expressed as mul_add for the suboptimal_flops gate. New atomic last_decay_queries_total tracks the queries-total snapshot at the last call; threaded through snapshot serde with #[serde(default)] so pre-fix snapshots restore cleanly as "first call after restore". Hot-path posture preserved — mark_query_at still touches only relaxed atomics.

2. shard.ttl event reported tier-mixed fields

Pre-fix, the event emitted only chosen_ttl_sec — the outgoing edge of the drive's current tier. Drives in different tiers therefore reported different underlying values (Warm→Parked vs Parked→Cold), and the soak validator's chosen_ttl_sec exceeds peers assertion was structurally impossible to pass under the default ladder (warm cap < parked base).

Fix (crates/uffs-daemon/src/index/transitions.rs): emit all four fields — chosen_ttl_sec (back-compat preserved), hot_ttl_sec, warm_ttl_sec, parked_ttl_sec — on every event so consumers can pick a single edge and compare across drives regardless of tier. Applied to all three event paths: min-tier-clamp / idle-demote / below-ttl.

3. Soak validator alignment

scripts/dev/long-soak.rs now uses warm_ttl_sec (the rate-sensitive Warm→Parked edge present on every drive's events) and the max observed value per drive across the soak — robust against EMA decay between the synthetic-load window and the validation read.

Tests

  • decay_ema_integrates_new_queries_into_rate_estimate — pins the Phase 6 contract: 60 q over 60 s lifts the EMA above 0.
  • decay_ema_idle_run_only_decays — pins the pure-decay-when-no-new-queries property (no random walk introduced).
  • shard_ttl_event_emits_all_three_thresholds — pins the four-field shape on the live demote controller path. Extracted into a sibling shard_ttl_events module to keep idle_demote.rs under the 800-LOC file-size policy (no exception added).

Local validation

  • cargo nextest run -p uffs-daemon --lib281/281 passed
  • just lint-fast — fmt-check, file-size, typos, reuse, lint-ci, lint-prod, lint-tests
  • just check-windowscargo xwin check --workspace --all-targets --all-features

Compliance audit (mandatory rules)

  1. No suppression hacks — file-size violation resolved by decomposition into shard_ttl_events.rs, not via file_size_exceptions.txt. The new #![expect(...)] block carries only the lints actually triggered (no blanket allows).
  2. Surgical, correct fixes — root-cause fix for the EMA (standard half-life blend), minimal additions for the tracing event (3 fields + 3 let bindings).
  3. Preserve behavior & contractschosen_ttl_sec field preserved for back-compat, first-call-returns-stored-value contract preserved (existing test still green), pre-fix snapshots restore cleanly.
  4. Improve tests, don't dodge them — three new tests added pinning the Phase 6 contracts; tolerances generous-but-meaningful so a regression that drops integration entirely fails decisively.

Follow-up

The Phase 6 24-h Windows soak should be re-run once this lands so we capture a clean warm_ttl_sec divergence between the queried target drive and idle peers.

…Phase 6 soak)

Two root-cause fixes from the 2026-05-07 Phase 6 24-h `min_tier="WARM"` soak:

1. `DriveStats::decay_ema` only ever decayed; new queries from `mark_query_at` were never integrated, so `rate_ema_micro_per_s` stayed at 0 regardless of search load. 24-h soak captured `rate_qpm=0.0` across all 2882 `shard.ttl` events for the queried drive — the adaptive bonus formula in `crate::cache::policy::warm_ttl` could never engage. Fix: standard half-life EMA blend `new = decay·prev + (1-decay)·sample` with `sample = delta_queries/elapsed_secs`, expressed as `mul_add` for the suboptimal_flops gate. New atomic `last_decay_queries_total` tracks the queries-total snapshot; threaded through snapshot serde with `#[serde(default)]` so pre-fix snapshots restore as "first call after restore". Hot-path posture preserved.

2. `shard.ttl` tracing event only emitted `chosen_ttl_sec` (the outgoing edge of the drive's *current* tier). Drives in different tiers reported different fields, and the soak validator's `chosen_ttl_sec exceeds peers` assertion was structurally impossible to pass under the default ladder (warm cap < parked base). Fix: emit all four fields — `chosen_ttl_sec` (back-compat), `hot_ttl_sec`, `warm_ttl_sec`, `parked_ttl_sec` — on every event so consumers can pick a single edge and compare across drives regardless of tier. Applied to all three event paths: min-tier-clamp / idle-demote / below-ttl.

3. Soak validator (`scripts/dev/long-soak.rs`) uses `warm_ttl_sec` (rate-sensitive Warm→Parked edge present on every drive) and the max observed value per drive — robust against EMA decay between the synthetic-load window and validation read.

Tests:

- `decay_ema_integrates_new_queries_into_rate_estimate` — 60 q over 60 s lifts EMA above 0.

- `decay_ema_idle_run_only_decays` — pure-decay-when-no-new-queries property.

- `shard_ttl_event_emits_all_three_thresholds` — four-field shape on the live demote controller path; extracted into a sibling `shard_ttl_events` module to keep `idle_demote.rs` under the 800-LOC file-size policy.

Mac gates green: cargo nextest (281/281), lint-fast, lint-tests, check-windows.
Two operational reliability fixes for the soak harness driven by the 2026-05-07 Phase 7 attempt that failed with 'Daemon did not become ready in time' even though the daemon was up and IPC-listening 1.3 s after spawn (per the captured daemon.log).

1. Idempotent attach. New `Daemon::is_ready()` helper; `Daemon::start` now pre-checks Ready at the top and skips the spawn entirely when a healthy daemon is already running. Honours the operator principle "don't kill a healthy daemon, don't try to start a second one".

2. Race-tolerant spawn. The CLI's own `daemon start` readiness probe has a tighter wall-clock budget than Windows AF_UNIX socket bind takes (~1.3 s observed; the CLI gave up at `attempt 3/20` with 'request timed out'). The harness now treats a non-zero spawn exit as advisory: if the daemon reaches Ready via our own 180 s status poll, we accept it and emit a one-line warning so the operator can see the race without the soak failing.

Phase 6 still calls `ensure_stopped() + start()` because it mutates daemon.toml. Phase 7 / ws-trace currently call `ensure_stopped()` too — that remains the operator's choice; the new `start` is safe either way.

Note: the underlying CLI bug — `UffsClientSync::await_ready` returning `Err(other)` on `ClientError::Timeout` instead of retrying — is not fixed here.  The soak harness bypasses it by polling `uffs daemon status` directly.  Filing a separate follow-up for the CLI-side fix.
@githubrobbi githubrobbi enabled auto-merge (squash) May 7, 2026 23:14
@githubrobbi githubrobbi merged commit fe1f2f1 into main May 7, 2026
26 checks passed
@githubrobbi githubrobbi deleted the fix/phase6-soak-ema-and-ttl-event-shape branch May 7, 2026 23:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant