Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,8 @@ members = [
"vendor/tri-mcp/rings/SR-02",
# CPU N-gram training (IGLA RACE Gate-2)
"crates/trios-train-cpu",
# JEPA-T ternary ingest pipeline (Wave-14a L-S50)
"crates/jepa_t_ingest",
# Trinity dePIN Mesh (Ch.35 PhD — L-DPC2/L-DPC3)
"crates/trios-mesh",
"crates/trios-mesh-node",
Expand Down
20 changes: 20 additions & 0 deletions crates/jepa_t_ingest/Cargo.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
[package]
name = "jepa_t_ingest"
version = "0.1.0"
edition = "2021"
authors = ["Dmitrii Vasilev <admin@t27.ai>"]
license = "Apache-2.0"
description = "Plaintext → ternary triplet streaming pipeline for JEPA-T training on Trinity silicon (Wave-14a L-S50)"
repository = "https://github.com/gHashTag/trios"
readme = "README.md"
keywords = ["ternary", "jepa", "trinity", "quantization", "nlp"]
categories = ["science", "encoding"]

[[bin]]
name = "jepa_t_ingest"
path = "src/bin/jepa_t_ingest.rs"

[dependencies]
clap = { version = "4", features = ["derive"] }

[dev-dependencies]
138 changes: 138 additions & 0 deletions crates/jepa_t_ingest/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
# jepa_t_ingest

**Wave-14a L-S50** — Plaintext → ternary-quantized triplet streaming pipeline for JEPA-T training on Trinity silicon.

[![License](https://img.shields.io/badge/license-Apache--2.0-blue.svg)](LICENSE)
[![Rust](https://img.shields.io/badge/rust-2021--edition-orange.svg)](https://www.rust-lang.org/)

## Overview

`jepa_t_ingest` converts raw UTF-8 text corpora into binary streams of ternary triplets
(`anchor`, `positive`, `negative`) suitable for Joint Embedding Predictive Architecture
(JEPA-T) contrastive pretraining on Trinity ternary silicon.

### Ternary Anchor

- **Alphabet**: {−1, 0, +1}
- **Threshold**: φ⁻² in Q1.15 fixed-point = **12533** (0x30F4)
- **Identity**: φ² + φ⁻² = 3
- **DOI**: [10.5281/zenodo.19227877](https://doi.org/10.5281/zenodo.19227877)

### Quantizer — Wave-9b RTL Byte-for-Byte Match

The core `quantize_phi_prior` function matches `phi_prior_quantizer.v` from Wave-9b exactly:

```
if fp_q15 >= +12533 → +1
if fp_q15 <= −12533 → −1
else → 0
```

## API

### `quantize_phi_prior(fp_q15: i16) -> i8`

Ternary quantizer with Wave-9b RTL parity.

```rust
use jepa_t_ingest::quantize_phi_prior;

assert_eq!(quantize_phi_prior(12533), 1); // at positive threshold
assert_eq!(quantize_phi_prior(-12533), -1); // at negative threshold
assert_eq!(quantize_phi_prior(12532), 0); // below threshold
assert_eq!(quantize_phi_prior(-12532), 0); // above -threshold
assert_eq!(quantize_phi_prior(0), 0); // zero
```

### `ingest_text(input: &str, cfg: &IngestConfig) -> Vec<Triplet>`

Streams a plaintext string into a sequence of ternary triplets.

```rust
use jepa_t_ingest::{ingest_text, IngestConfig};

let cfg = IngestConfig { window_size: 64, stride: 32 };
let triplets = ingest_text("your corpus text here ...", &cfg);
println!("{} triplets produced", triplets.len());
```

### `Triplet`

```rust
pub struct Triplet {
pub anchor: [i8; 64], // anchor context window
pub positive: [i8; 64], // adjacent / overlapping window
pub negative: [i8; 64], // non-overlapping window (hard negative)
}
```

Each element is in {−1, 0, +1}. Serialise to binary with `triplet.to_bytes()` (192 bytes).

### `IngestConfig`

```rust
pub struct IngestConfig {
pub window_size: usize, // tokens per window (max 64)
pub stride: usize, // step between anchor windows
}
```

## CLI Binary

```
jepa_t_ingest --input corpus.txt --output triplets.bin [--window-size 64] [--stride 32]
```

### Output Format

Raw binary stream of packed 192-byte triplet records:

| Bytes | Content |
|-------|---------|
| 0–63 | anchor (64 × i8) |
| 64–127 | positive (64 × i8) |
| 128–191 | negative (64 × i8) |

## Tests

```bash
# Run all tests (quantizer boundary + ingest golden integration)
cargo test -p jepa_t_ingest

# Build release binary
cargo build --release --bin jepa_t_ingest
```

### Quantizer Boundary Tests (`tests/quantize.rs`)

| Input | Expected | Notes |
|-------|----------|-------|
| +12532 | 0 | one below threshold |
| +12533 | +1 | at threshold (φ⁻²) |
| −12532 | 0 | one above −threshold |
| −12533 | −1 | at −threshold |
| 0 | 0 | zero |
| +0x7FFF | +1 | i16::MAX |
| −0x8000 | −1 | i16::MIN |

The exhaustive test `output_always_ternary_for_all_i16` checks all 65536 possible i16 inputs.

### Integration Test (`tests/ingest.rs`)

Uses a fixed 13-token golden corpus:

```
"the quick brown fox jumps over the lazy dog and a ternary world"
```

With `window_size=4, stride=2` this produces **3 triplets** (5 windows).
Token hashes and ternary values are byte-compared against pre-computed golden values.

## R1 CROWN Compliance

This crate is **Rust ONLY** — no Python, no shell scripts, no foreign-language source files.
The quantizer is a single `#[inline]` function with no dependencies beyond `core`.

## License

Apache-2.0 — Copyright 2024 Dmitrii Vasilev &lt;admin@t27.ai&gt;
133 changes: 133 additions & 0 deletions crates/jepa_t_ingest/src/bin/jepa_t_ingest.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
//! # jepa_t_ingest — CLI binary
//!
//! Streams a plaintext corpus file into a binary file of ternary triplets
//! for JEPA-T training on Trinity silicon.
//!
//! ## Usage
//!
//! ```text
//! jepa_t_ingest --input corpus.txt --output triplets.bin [--window-size 64] [--stride 32]
//! ```
//!
//! ## Output format
//!
//! Sequence of packed triplets, each 192 bytes:
//! - bytes 0..63 : anchor (i8 values in {-1, 0, +1})
//! - bytes 64..127 : positive
//! - bytes 128..191 : negative
//!
//! ## License
//!
//! Apache-2.0 — Author: Dmitrii Vasilev <admin@t27.ai>

use std::{
fs,
io::{self, Write},
path::PathBuf,
process,
};

use clap::Parser;
use jepa_t_ingest::{ingest_text, IngestConfig};

/// JEPA-T Ternary Ingest Pipeline (Wave-14a L-S50)
///
/// Converts a plaintext corpus into binary ternary triplets for JEPA-T training.
/// Output is a raw binary stream of packed 192-byte triplet records.
#[derive(Parser, Debug)]
#[command(
name = "jepa_t_ingest",
version = env!("CARGO_PKG_VERSION"),
author = "Dmitrii Vasilev <admin@t27.ai>",
about = "Plaintext → ternary triplet pipeline for JEPA-T training on Trinity silicon"
)]
struct Args {
/// Input plaintext corpus file (UTF-8)
#[arg(short, long, value_name = "FILE")]
input: PathBuf,

/// Output binary file for ternary triplets (192 bytes each)
#[arg(short, long, value_name = "FILE")]
output: PathBuf,

/// Context window size in tokens (max 64)
#[arg(long, default_value_t = 64, value_name = "N")]
window_size: usize,

/// Stride between successive windows in tokens
#[arg(long, default_value_t = 32, value_name = "N")]
stride: usize,
}

fn main() {
let args = Args::parse();

let cfg = IngestConfig {
window_size: args.window_size.min(64).max(1),
stride: args.stride.max(1),
};

// Read corpus.
let corpus = match fs::read_to_string(&args.input) {
Ok(s) => s,
Err(e) => {
eprintln!(
"jepa_t_ingest: cannot read '{}': {}",
args.input.display(),
e
);
process::exit(1);
}
};

eprintln!(
"jepa_t_ingest: read {} bytes from '{}'",
corpus.len(),
args.input.display()
);

// Ingest into triplets.
let triplets = ingest_text(&corpus, &cfg);

eprintln!("jepa_t_ingest: produced {} triplets", triplets.len());

if triplets.is_empty() {
eprintln!("jepa_t_ingest: warning — zero triplets produced (corpus too short?)");
}

// Write binary output.
let out_file = match fs::File::create(&args.output) {
Ok(f) => f,
Err(e) => {
eprintln!(
"jepa_t_ingest: cannot create '{}': {}",
args.output.display(),
e
);
process::exit(1);
}
};
let mut writer = io::BufWriter::new(out_file);

let mut bytes_written = 0usize;
for triplet in &triplets {
let bytes = triplet.to_bytes();
match writer.write_all(&bytes) {
Ok(()) => bytes_written += bytes.len(),
Err(e) => {
eprintln!("jepa_t_ingest: write error: {}", e);
process::exit(1);
}
}
}

eprintln!(
"jepa_t_ingest: wrote {} bytes to '{}'",
bytes_written,
args.output.display()
);
eprintln!(
"jepa_t_ingest: done (window_size={}, stride={})",
cfg.window_size, cfg.stride
);
}
Loading
Loading