Skip to content

feat: CAM (Content-Addressable Memory) header for Fingerprint-first search #89

@AdaWorldAPI

Description

@AdaWorldAPI

Content-Addressable Memory (CAM) Proposal

Context

ladybug-rs already has core::fingerprint::Fingerprint — a 10,000-bit VSA fingerprint (157 u64 words). This proposal elevates the fingerprint from an inline data type to a fixed-offset header component that enables O(1) dedup and two-phase search across all vector containers.

See also: holograph#1 for the storage-layer counterpart.

The 64-word header

Every stored record gets a fixed 512-byte prefix:

 ┌──────────┬──────────────┬─────────────────────────┐
 │ 32 meta  │ 32 fingerprint│  N × 128 content        │
 │ offset 0 │ offset 256B   │  offset 512B            │
 └──────────┴──────────────┴─────────────────────────┘
 ← HEADER: always 512 bytes →← variable quanta       →

Container envelope (1 quantum = 128 words = 8,192 bits):

MONO:   32 + 32 + 128       = 192 words   1.50 KB
DENSE:  32 + 32 + 128 + 128 = 320 words   2.50 KB
HOLO:   32 + 32 + 128×3     = 448 words   3.50 KB

How this maps to existing ladybug-rs types

Existing type Role in CAM Notes
Fingerprint (10K) Content region (MONO quantum) Unchanged — still the core VSA vector
Fingerprint::from_content() Fingerprint generation XOR-fold 157 words → 32 words for the header sketch
core::simd Hamming on fingerprint + content Fingerprint scan is 32 words = 4 AVX-512 iterations
core::index::VsaIndex CAM index integration Fingerprint hash table for O(1) dedup before ANN
cognitive::collapse_gate Meta field consumer Container kind, ANI level, consciousness flags in meta
cognitive::seven_layer Layer markers in meta 7-layer state fits in meta words 13-16

The 32-word fingerprint as content sketch

/// Generate 2048-bit CAM fingerprint from a full Fingerprint
pub fn cam_sketch(fp: &Fingerprint) -> [u64; 32] {
    let raw = fp.as_raw();  // &[u64; 157]
    let mut sketch = [0u64; 32];

    // XOR-fold: 157 words → 32 words
    for (i, &word) in raw.iter().enumerate() {
        sketch[i % 32] ^= word;
    }
    sketch
}

/// Fast pre-filter: Hamming on 2048-bit sketches
pub fn sketch_distance(a: &[u64; 32], b: &[u64; 32]) -> u32 {
    a.iter().zip(b.iter())
        .map(|(x, y)| (x ^ y).count_ones())
        .sum()
}

The sketch preserves Hamming properties: if two full fingerprints are close, their sketches are close. The converse isn't guaranteed (false positives), but that's fine — the sketch is a pre-filter, not the final answer.

Two-phase search integration

Currently VsaIndex::search() does full Hamming on all candidates. With CAM:

Phase 0: CAM sketch scan (32 words per candidate, ~32 cycles)
  → sketch_distance < threshold → promote to Phase 1
  → Rejects ~95% of candidates at 1/5th the cost of full scan

Phase 1: Full Hamming (157 words per candidate, ~157 cycles)
  → Only on Phase 0 survivors
  → Existing search pipeline unchanged

For 1M vectors: Phase 0 touches 32M words instead of 157M words. ~5× throughput improvement on the scan loop.

O(1) Dedup

use std::collections::HashMap;

struct CamIndex {
    /// fingerprint hash → record offset
    dedup: HashMap<[u64; 32], usize>,
}

impl CamIndex {
    fn insert(&mut self, sketch: [u64; 32], offset: usize) -> bool {
        // O(1) exact-content dedup
        self.dedup.insert(sketch, offset).is_none()
    }

    fn lookup(&self, sketch: &[u64; 32]) -> Option<usize> {
        self.dedup.get(sketch).copied()
    }
}

Relationship to cognitive layer

The 32-word meta block carries cognitive state that the cognitive/ modules already produce:

  • collapse_gate: FLOW/HOLD/BLOCK decision → meta word 0, bits 8-9
  • seven_layer: Layer activation pattern → meta words 13-16
  • style: ThinkingStyle vector → meta words 5-8 (τ/σ/q compressed)
  • rung: Rung level (3-5) → meta word 1, bits 0-7
  • substrate: Substrate state hash → meta word 2

This means every stored vector carries its cognitive context inline, queryable without touching the content region.

Implementation plan

  1. Add cam.rs to core/ with CamHeader, CamRecord, cam_sketch(), sketch_distance()
  2. Extend VsaIndex with optional CAM pre-filter
  3. Add CamHeader serialization to Arrow FixedSizeBinary for LanceDB storage
  4. Wire cognitive layer metadata into meta block at store time

Open questions

  1. Should Fingerprint grow a .cam_sketch() -> [u64; 32] method, or keep it external?
  2. The 32-word sketch is 2048 bits — enough discrimination? Or should we use 16 words (1024 bits) to save space?
  3. Integration with extensions/hologram/ bitchain types — do they get CAM headers too?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions