Skip to content

Commit 428f496

Browse files
committed
fix(dn_tree): make_probability_mask infinite-recursion at p=0.5
Latent bug surfaced during the Pillar 13 drift-check wiring (#191): `make_probability_mask` used `p >= 0.5` to invert the (1-p) mask, which recursed with `1.0 - 0.5 = 0.5` infinitely whenever p was exactly 0.5. Pillar 13's independent re-derivation used the strict `p > 0.5` and correctly fell through to the AND-cascade — that's the canonical reference this fix matches. Real production usage (DNConfig default lr=0.03 with boost ~30 → effective_lr ≈ 0.9) never hit 0.5 exactly so the bug was dormant. Now that it's fixed: * Update Pillar 13's drift-check to use lr=0.5 (its canonical mid-range value per the pillar spec) instead of the lr=0.25 workaround. The drift-check now exercises the previously-broken branch and continues to pass bit-exactly. * Add two regression tests on dn_tree itself: - `make_probability_mask_at_half_terminates` — would stack-overflow if the fix regresses. - `make_probability_mask_at_half_is_bernoulli_half` — empirical popcount mean over N=1024 lands near 32 within 16 standard errors. No public API change. The fix is two characters: `>=` → `>`.
1 parent adeba8a commit 428f496

2 files changed

Lines changed: 55 additions & 7 deletions

File tree

src/hpc/dn_tree.rs

Lines changed: 49 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -132,8 +132,15 @@ pub(crate) fn bundle_into(current: &GraphHV, hv: &GraphHV, lr: f64, boost: f64,
132132
/// Create a u64 bitmask where each bit is independently 1 with probability ~`p`.
133133
///
134134
/// Uses cascaded AND of random words to achieve the target probability:
135-
/// - p >= 0.5 → OR of inverse masks
136-
/// - p < 0.5 → AND cascade
135+
/// - p > 0.5 → invert the (1-p) mask
136+
/// - p <= 0.5 → AND cascade
137+
///
138+
/// At exactly `p = 0.5` the AND-cascade branch executes a single
139+
/// `rng.next_u64()` (n = ceil(-log2(0.5)) = 1) — each bit is then
140+
/// IID Bernoulli(0.5). Note the **strict** comparison here: an earlier
141+
/// version used `p >= 0.5`, which recursed with `1.0 - 0.5 = 0.5`
142+
/// infinitely. The Pillar-13 drift-check (`hpc::pillar::hhtl_contraction`)
143+
/// already uses the strict comparison and is the canonical reference.
137144
fn make_probability_mask(p: f64, rng: &mut SplitMix64) -> u64 {
138145
if p >= 1.0 {
139146
return u64::MAX;
@@ -142,13 +149,14 @@ fn make_probability_mask(p: f64, rng: &mut SplitMix64) -> u64 {
142149
return 0;
143150
}
144151

145-
if p >= 0.5 {
146-
// p >= 0.5: use OR approach — each AND of randoms gives ~0.25, NOT gives ~0.75, etc.
147-
// Simpler: just AND enough randoms to get (1-p) kill rate, then NOT.
152+
if p > 0.5 {
153+
// p > 0.5: invert the (1-p) mask. Strict > 0.5 so p == 0.5
154+
// falls through to the AND-cascade and produces a single
155+
// Bernoulli(0.5) word in one rng draw.
148156
return !make_probability_mask(1.0 - p, rng);
149157
}
150158

151-
// p < 0.5: AND cascade. Each AND halves the probability.
159+
// p <= 0.5: AND cascade. Each AND halves the probability.
152160
// We need n ANDs where 0.5^n ≈ p, so n = -log2(p).
153161
let n = (-p.log2()).ceil() as u32;
154162
let mut mask = rng.next_u64();
@@ -543,6 +551,41 @@ mod tests {
543551
SplitMix64::new(42)
544552
}
545553

554+
/// Regression: at p = 0.5 exactly, the previous `p >= 0.5` branch
555+
/// recursed with `1.0 - 0.5 = 0.5` infinitely. The strict `p > 0.5`
556+
/// fix routes p=0.5 to the AND-cascade (n=1, one rng draw) which
557+
/// produces a Bernoulli(0.5) mask in O(1) time.
558+
#[test]
559+
fn make_probability_mask_at_half_terminates() {
560+
let mut rng = make_rng();
561+
// If this stack-overflows, the recursion fix has regressed.
562+
let mask = make_probability_mask(0.5, &mut rng);
563+
// Bernoulli(0.5) over 64 bits — popcount should be near 32, but
564+
// any value 0..=64 is valid for a single draw. The test's
565+
// load-bearing assertion is that the call returns.
566+
assert!(mask <= u64::MAX);
567+
}
568+
569+
/// Empirical Bernoulli(0.5) check: average popcount over N=1024
570+
/// independent masks must land near 32 (the true mean) within a
571+
/// generous tolerance.
572+
#[test]
573+
fn make_probability_mask_at_half_is_bernoulli_half() {
574+
let mut rng = make_rng();
575+
const N: u32 = 1024;
576+
let mut total: u64 = 0;
577+
for _ in 0..N {
578+
total += make_probability_mask(0.5, &mut rng).count_ones() as u64;
579+
}
580+
let mean = total as f64 / N as f64;
581+
// σ per word = sqrt(64 * 0.5 * 0.5) = 4; mean's SE = 4 / √N = 0.125.
582+
// Tolerance 2.0 ≈ 16 SEs — comfortable margin against flakes.
583+
assert!(
584+
(mean - 32.0).abs() < 2.0,
585+
"make_probability_mask(0.5) mean popcount {mean:.4} not near 32"
586+
);
587+
}
588+
546589
#[test]
547590
fn test_new_tree_structure() {
548591
let tree = DNTree::with_capacity(4096);

src/hpc/pillar/hhtl_contraction.rs

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -486,7 +486,12 @@ mod tests {
486486
use crate::hpc::dn_tree::{bundle_into, SplitMix64 as DnSplitMix64};
487487

488488
const N_TRIALS: u32 = 16;
489-
const TEST_LR: f64 = 0.25;
489+
// Was 0.25 to avoid the latent p=0.5 infinite-recursion bug in
490+
// production's make_probability_mask; that bug is fixed in the
491+
// same commit/PR that updates this constant. lr=0.5 now matches
492+
// Pillar 13's canonical mid-range learning rate and exercises
493+
// the previously-broken branch.
494+
const TEST_LR: f64 = 0.5;
490495

491496
// Both SplitMix64 implementations use identical algorithm (same
492497
// multiplier constants 0x9E3779B97F4A7C15, 0xBF58476D1CE4E5B9,

0 commit comments

Comments
 (0)