fix(daemon): adaptive EMA integration + shard.ttl 3-threshold shape (Phase 6 soak)#146
Merged
githubrobbi merged 2 commits intomainfrom May 7, 2026
Merged
Conversation
…Phase 6 soak) Two root-cause fixes from the 2026-05-07 Phase 6 24-h `min_tier="WARM"` soak: 1. `DriveStats::decay_ema` only ever decayed; new queries from `mark_query_at` were never integrated, so `rate_ema_micro_per_s` stayed at 0 regardless of search load. 24-h soak captured `rate_qpm=0.0` across all 2882 `shard.ttl` events for the queried drive — the adaptive bonus formula in `crate::cache::policy::warm_ttl` could never engage. Fix: standard half-life EMA blend `new = decay·prev + (1-decay)·sample` with `sample = delta_queries/elapsed_secs`, expressed as `mul_add` for the suboptimal_flops gate. New atomic `last_decay_queries_total` tracks the queries-total snapshot; threaded through snapshot serde with `#[serde(default)]` so pre-fix snapshots restore as "first call after restore". Hot-path posture preserved. 2. `shard.ttl` tracing event only emitted `chosen_ttl_sec` (the outgoing edge of the drive's *current* tier). Drives in different tiers reported different fields, and the soak validator's `chosen_ttl_sec exceeds peers` assertion was structurally impossible to pass under the default ladder (warm cap < parked base). Fix: emit all four fields — `chosen_ttl_sec` (back-compat), `hot_ttl_sec`, `warm_ttl_sec`, `parked_ttl_sec` — on every event so consumers can pick a single edge and compare across drives regardless of tier. Applied to all three event paths: min-tier-clamp / idle-demote / below-ttl. 3. Soak validator (`scripts/dev/long-soak.rs`) uses `warm_ttl_sec` (rate-sensitive Warm→Parked edge present on every drive) and the max observed value per drive — robust against EMA decay between the synthetic-load window and validation read. Tests: - `decay_ema_integrates_new_queries_into_rate_estimate` — 60 q over 60 s lifts EMA above 0. - `decay_ema_idle_run_only_decays` — pure-decay-when-no-new-queries property. - `shard_ttl_event_emits_all_three_thresholds` — four-field shape on the live demote controller path; extracted into a sibling `shard_ttl_events` module to keep `idle_demote.rs` under the 800-LOC file-size policy. Mac gates green: cargo nextest (281/281), lint-fast, lint-tests, check-windows.
Two operational reliability fixes for the soak harness driven by the 2026-05-07 Phase 7 attempt that failed with 'Daemon did not become ready in time' even though the daemon was up and IPC-listening 1.3 s after spawn (per the captured daemon.log). 1. Idempotent attach. New `Daemon::is_ready()` helper; `Daemon::start` now pre-checks Ready at the top and skips the spawn entirely when a healthy daemon is already running. Honours the operator principle "don't kill a healthy daemon, don't try to start a second one". 2. Race-tolerant spawn. The CLI's own `daemon start` readiness probe has a tighter wall-clock budget than Windows AF_UNIX socket bind takes (~1.3 s observed; the CLI gave up at `attempt 3/20` with 'request timed out'). The harness now treats a non-zero spawn exit as advisory: if the daemon reaches Ready via our own 180 s status poll, we accept it and emit a one-line warning so the operator can see the race without the soak failing. Phase 6 still calls `ensure_stopped() + start()` because it mutates daemon.toml. Phase 7 / ws-trace currently call `ensure_stopped()` too — that remains the operator's choice; the new `start` is safe either way. Note: the underlying CLI bug — `UffsClientSync::await_ready` returning `Err(other)` on `ClientError::Timeout` instead of retrying — is not fixed here. The soak harness bypasses it by polling `uffs daemon status` directly. Filing a separate follow-up for the CLI-side fix.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two root-cause fixes from the 2026-05-07 Phase 6 24-h
min_tier="WARM"soak plus a corresponding update to the soak validator so future runs measure the right thing.Findings & fixes
1.
DriveStats::decay_emaonly decayed — never integrated new queriesThe 24-h soak captured
rate_qpm=0.0across all 2882shard.ttlevents for the queried drive. Root cause:decay_emaonly applied exponential decay;mark_query_atbumps were never folded into the EMA. The adaptive bonus formula incrate::cache::policy::warm_ttltherefore could never engage in production.Fix (
crates/uffs-daemon/src/cache/shard/drive_stats.rs): standard half-life EMA blendnew = decay·prev + (1-decay)·samplewithsample = delta_queries/elapsed_secs, expressed asmul_addfor thesuboptimal_flopsgate. New atomiclast_decay_queries_totaltracks the queries-total snapshot at the last call; threaded through snapshot serde with#[serde(default)]so pre-fix snapshots restore cleanly as "first call after restore". Hot-path posture preserved —mark_query_atstill touches only relaxed atomics.2.
shard.ttlevent reported tier-mixed fieldsPre-fix, the event emitted only
chosen_ttl_sec— the outgoing edge of the drive's current tier. Drives in different tiers therefore reported different underlying values (Warm→Parked vs Parked→Cold), and the soak validator'schosen_ttl_sec exceeds peersassertion was structurally impossible to pass under the default ladder (warm cap < parked base).Fix (
crates/uffs-daemon/src/index/transitions.rs): emit all four fields —chosen_ttl_sec(back-compat preserved),hot_ttl_sec,warm_ttl_sec,parked_ttl_sec— on every event so consumers can pick a single edge and compare across drives regardless of tier. Applied to all three event paths:min-tier-clamp/idle-demote/below-ttl.3. Soak validator alignment
scripts/dev/long-soak.rsnow useswarm_ttl_sec(the rate-sensitive Warm→Parked edge present on every drive's events) and the max observed value per drive across the soak — robust against EMA decay between the synthetic-load window and the validation read.Tests
decay_ema_integrates_new_queries_into_rate_estimate— pins the Phase 6 contract: 60 q over 60 s lifts the EMA above 0.decay_ema_idle_run_only_decays— pins the pure-decay-when-no-new-queries property (no random walk introduced).shard_ttl_event_emits_all_three_thresholds— pins the four-field shape on the live demote controller path. Extracted into a siblingshard_ttl_eventsmodule to keepidle_demote.rsunder the 800-LOC file-size policy (no exception added).Local validation
cargo nextest run -p uffs-daemon --lib— 281/281 passedjust lint-fast— fmt-check, file-size, typos, reuse, lint-ci, lint-prod, lint-testsjust check-windows—cargo xwin check --workspace --all-targets --all-featuresCompliance audit (mandatory rules)
shard_ttl_events.rs, not viafile_size_exceptions.txt. The new#![expect(...)]block carries only the lints actually triggered (no blanket allows).letbindings).chosen_ttl_secfield preserved for back-compat, first-call-returns-stored-value contract preserved (existing test still green), pre-fix snapshots restore cleanly.Follow-up
The Phase 6 24-h Windows soak should be re-run once this lands so we capture a clean
warm_ttl_secdivergence between the queried target drive and idle peers.