-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Content-Addressable Memory (CAM) Proposal
Context
ladybug-rs already has core::fingerprint::Fingerprint — a 10,000-bit VSA fingerprint (157 u64 words). This proposal elevates the fingerprint from an inline data type to a fixed-offset header component that enables O(1) dedup and two-phase search across all vector containers.
See also: holograph#1 for the storage-layer counterpart.
The 64-word header
Every stored record gets a fixed 512-byte prefix:
┌──────────┬──────────────┬─────────────────────────┐
│ 32 meta │ 32 fingerprint│ N × 128 content │
│ offset 0 │ offset 256B │ offset 512B │
└──────────┴──────────────┴─────────────────────────┘
← HEADER: always 512 bytes →← variable quanta →
Container envelope (1 quantum = 128 words = 8,192 bits):
MONO: 32 + 32 + 128 = 192 words 1.50 KB
DENSE: 32 + 32 + 128 + 128 = 320 words 2.50 KB
HOLO: 32 + 32 + 128×3 = 448 words 3.50 KB
How this maps to existing ladybug-rs types
| Existing type | Role in CAM | Notes |
|---|---|---|
Fingerprint (10K) |
Content region (MONO quantum) | Unchanged — still the core VSA vector |
Fingerprint::from_content() |
Fingerprint generation | XOR-fold 157 words → 32 words for the header sketch |
core::simd |
Hamming on fingerprint + content | Fingerprint scan is 32 words = 4 AVX-512 iterations |
core::index::VsaIndex |
CAM index integration | Fingerprint hash table for O(1) dedup before ANN |
cognitive::collapse_gate |
Meta field consumer | Container kind, ANI level, consciousness flags in meta |
cognitive::seven_layer |
Layer markers in meta | 7-layer state fits in meta words 13-16 |
The 32-word fingerprint as content sketch
/// Generate 2048-bit CAM fingerprint from a full Fingerprint
pub fn cam_sketch(fp: &Fingerprint) -> [u64; 32] {
let raw = fp.as_raw(); // &[u64; 157]
let mut sketch = [0u64; 32];
// XOR-fold: 157 words → 32 words
for (i, &word) in raw.iter().enumerate() {
sketch[i % 32] ^= word;
}
sketch
}
/// Fast pre-filter: Hamming on 2048-bit sketches
pub fn sketch_distance(a: &[u64; 32], b: &[u64; 32]) -> u32 {
a.iter().zip(b.iter())
.map(|(x, y)| (x ^ y).count_ones())
.sum()
}The sketch preserves Hamming properties: if two full fingerprints are close, their sketches are close. The converse isn't guaranteed (false positives), but that's fine — the sketch is a pre-filter, not the final answer.
Two-phase search integration
Currently VsaIndex::search() does full Hamming on all candidates. With CAM:
Phase 0: CAM sketch scan (32 words per candidate, ~32 cycles)
→ sketch_distance < threshold → promote to Phase 1
→ Rejects ~95% of candidates at 1/5th the cost of full scan
Phase 1: Full Hamming (157 words per candidate, ~157 cycles)
→ Only on Phase 0 survivors
→ Existing search pipeline unchanged
For 1M vectors: Phase 0 touches 32M words instead of 157M words. ~5× throughput improvement on the scan loop.
O(1) Dedup
use std::collections::HashMap;
struct CamIndex {
/// fingerprint hash → record offset
dedup: HashMap<[u64; 32], usize>,
}
impl CamIndex {
fn insert(&mut self, sketch: [u64; 32], offset: usize) -> bool {
// O(1) exact-content dedup
self.dedup.insert(sketch, offset).is_none()
}
fn lookup(&self, sketch: &[u64; 32]) -> Option<usize> {
self.dedup.get(sketch).copied()
}
}Relationship to cognitive layer
The 32-word meta block carries cognitive state that the cognitive/ modules already produce:
collapse_gate: FLOW/HOLD/BLOCK decision → meta word 0, bits 8-9seven_layer: Layer activation pattern → meta words 13-16style: ThinkingStyle vector → meta words 5-8 (τ/σ/q compressed)rung: Rung level (3-5) → meta word 1, bits 0-7substrate: Substrate state hash → meta word 2
This means every stored vector carries its cognitive context inline, queryable without touching the content region.
Implementation plan
- Add
cam.rstocore/withCamHeader,CamRecord,cam_sketch(),sketch_distance() - Extend
VsaIndexwith optional CAM pre-filter - Add
CamHeaderserialization to Arrow FixedSizeBinary for LanceDB storage - Wire cognitive layer metadata into meta block at store time
Open questions
- Should
Fingerprintgrow a.cam_sketch() -> [u64; 32]method, or keep it external? - The 32-word sketch is 2048 bits — enough discrimination? Or should we use 16 words (1024 bits) to save space?
- Integration with
extensions/hologram/bitchain types — do they get CAM headers too?