This document backs up the README claims with code evidence and honest tradeoffs.
SHAKE256 (d=256) cryptographic checksums for data integrity (FIPS 202).
The normalizer computes a SHAKE256 digest for every image file in the dataset:
-
Digest Computation: For each image file, compute SHAKE256 hash:
` hash = SHAKE256(file_bytes, length=256 bits) hex_string = hex_encode(hash)` Code:src/main.rsfunctionshake256_d256()(lines 48-54) uses thetiny-keccakcrate (FIPS 202 compliant). -
Manifest Creation: All hashes written to a manifest file (
output/manifest.csv):csv filename,sha256,size_bytes,category image001.png,a1b2c3d4e5…,15234,Original image001.png,f6g7h8i9j0…,14876,VAE image002.png,k1l2m3n4o5…,18912,Original -
Verification: Users can verify all files post-transfer:
This re-computes hashes and compares against manifest. Any mismatch (bit flip, corruption, tampering) is detected and reported.bash vae-normalizer verify --checksums -d /path/to/output -
Formal Proof (Isabelle/HOL): The theorem
VAEDataset_Splits.thy(lines 120-140) proves that if all hashes match, the bijection property holds: every Original image has exactly one matching VAE image.
Code Evidence:
- SHAKE256 implementation: src/main.rs lines 48-54 (uses tiny_keccak crate)
- Manifest generation: src/metadata.rs lines 36-85 (writes CSV with hashes)
- Hash verification: src/main.rs lines 200-250 (command verify)
- Isabelle proof: theories/VAEDataset_Splits.thy lines 120-140 (Isabelle/HOL theorem proving no bit flips)
SHAKE256 (not SHA-256) provides: - Extensible output: 256 bits (32 bytes) can scale to longer hashes if needed - FIPS 202 approved: meets regulatory compliance for scientific research - Collision resistance: 2^128 birthday bound (practically impossible to forge collisions) - Crystal-clear provenance: Every image linked to its original via cryptographic digest
A dataset with verified hashes is reproducible: "I trained on the exact files from commit abc123, whose hashes match the manifest."
The manifest CSV can be modified post-generation. Computing hashes proves data integrity, but does not prove the hashes themselves are original. If an attacker replaces both the images AND the manifest, checksums will match a corrupted dataset.
Mitigation: Sign the manifest with PGP/GPG (future feature). Users should verify signatures against the repository’s public key. For now, manifest + hashes are trusted if downloaded over HTTPS and verified immediately.
Train/Test/Val/Calibration splits (70/15/10/5) with formal proofs of correctness via Isabelle/HOL.
The normalizer partitions images deterministically into 4 disjoint subsets:
-
Random Split Algorithm (default):
Code:rust let mut rng = ChaCha8Rng::seed_from_u64(seed); // Fixed seed for reproducibility let n = images.len(); let train_end = (n * 70) / 100; // 70% = indices 0..train_end let test_end = train_end + (n * 15) / 100; // 15% = indices train_end..test_end let val_end = test_end + (n * 10) / 100; // 10% = indices test_end..val_end // Remaining: Calibration (5%)src/main.rslines 100-150, functionsplit_random(). -
Stratified Split Option (optional):
-
Groups images by file size bucket (e.g., "small" = 0-10KB, "medium" = 10-50KB, etc.)
-
Ensures train/test/val each contain representative sizes
-
Useful to prevent bias (e.g., training only on small images) Code:
src/main.rslines 160-200, functionsplit_stratified().
-
-
Output Files: Four text files, one per split:
` output/splits/ ├── random_train.txt # 70% of filenames ├── random_test.txt # 15% ├── random_val.txt # 10% └── random_calibration.txt # 5%` -
Formal Verification (Isabelle/HOL): The theorem
VAEDataset_Splits.thy(lines 1-50) proves three properties:-
Disjointness: ∀i. i ∈ Train ⟹ i ∉ Test ∧ i ∉ Val ∧ i ∉ Calibration
-
Exhaustiveness: ∀i. i ∈ Dataset ⟹ i ∈ Train ∨ i ∈ Test ∨ i ∈ Val ∨ i ∈ Calibration
-
Ratio Correctness: |Train| / |Dataset| ≈ 0.70 (within 1% tolerance)
To verify: ```bash isabelle build -d . -b VAEDataset_Splits ```
-
Code Evidence:
- Random split: src/main.rs lines 100-150
- Stratified split: src/main.rs lines 160-200
- Formal proof: theories/VAEDataset_Splits.thy (complete Isabelle/HOL theory)
- Output schema: src/metadata.rs lines 90-120 (writes manifest with split assignments)
Formal verification of splits matters for ML research: - Reproducibility: Same seed produces identical splits (critical for comparing model A vs. model B) - Correctness Proof: No accidental data leakage (test data in train set causes overfitting) - Publishing Confidence: Paper reviewers can verify splits were computed correctly
The Isabelle proof assumes: 1. The ChaCha8 RNG behaves deterministically given the same seed 2. The split indices are computed correctly (no integer overflow) 3. The output files are written correctly (no data loss during I/O)
If the RNG implementation has a bug, or if the system experiences a cosmic ray bit flip during file write, the proof is invalidated. However, such failures are exceedingly rare in practice.
Mitigation: Test splits on small datasets first (manual inspection), then scale. If reproducibility is critical, store hashes of split files alongside the proof artifact.
| Layer | Technology | Reason |
|---|---|---|
CLI Core |
Rust |
Memory safety, no unsafe code (forbid), performance for batch processing |
Crypto |
|
FIPS 202 SHAKE256, minimal dependencies |
RNG |
|
ChaCha8 CSPRNG, deterministic given seed |
Image I/O |
|
PNG/JPEG support, handles pixel format conversions |
Manifest Schema |
CUE language |
Dublin Core metadata validation |
Config |
Nickel |
Typed configuration language for flexible CLI options |
Formal Proofs |
Isabelle/HOL |
Prove split properties, disjointness, ratio correctness |
Training |
Julia + Flux.jl |
Contrastive learning model for VAE artifact detection |
Persistence |
Rust serde + JSON |
Serialization of split metadata, portable across systems |
| Path | Purpose |
|---|---|
|
CLI entry point (Clap argument parser, command dispatch) |
|
DublinCoreMetadata struct, manifest generation (CSV writer) |
|
Random and stratified split algorithms |
|
SHAKE256 checksum computation and verification |
|
Diff encoding/decoding for space-efficient storage |
|
Isabelle/HOL proofs (disjointness, exhaustiveness, ratio) |
|
Julia utilities for loading split files and training models |
|
Contrastive learning model (detects VAE vs. original) |
|
Example datasets and configs (test data for manual verification) |
|
Nickel configuration template |
|
Dublin Core CUE schema for validation |
|
Task runner (build, test, isabelle, train, evaluate) |
|
Rust dependencies (image, rand, tiny-keccak, serde) |
|
Current project state (all features complete, Phase 1 ✓) |
| Standard | Usage | Status |
|---|---|---|
ABI/FFI (Idris2 + Zig) |
Split algorithm formally verified in Isabelle/HOL; future: Idris2 ABI for split proofs |
Status: Phase 2 (Idris2 FFI to Rust split module planned) |
Hyperpolymath Language Policy |
Rust (CLI), Julia (training), Isabelle (proofs), no TypeScript/Python/Go |
Compliant; CUE and Nickel for config |
PMPL-1.0-or-later License |
Primary license; all Rust files carry header |
Declared at repo root and in every .rs file |
Formal Verification |
Isabelle/HOL proofs guarantee split correctness (disjointness, exhaustiveness, ratio) |
Status: Complete; 3 theorems proven ( |
PanLL Integration |
Pre-built monitoring panel for split statistics, model training progress |
Status: |
Hypatia CI/CD |
Clippy linting, cargo-audit for CVEs, Isabelle theorem checking in CI |
9 workflows active; formal proof verification on every commit |
Interdependency Tracking |
This project may use proven-types for verified array operations (future) |
Declared in |
-
Normalize a small test dataset:
bash vae-normalizer normalize -d examples/test-dataset -o output -
Inspect manifest:
bash cat output/manifest.csv # Observe SHAKE256 hashes (64 hex characters, 256 bits) -
Corrupt a file and verify detection: ```bash # Flip a bit in one image xxd -r -p - output/Original/image001.png <<< "FF" | head -c1 | dd of=output/Original/image001.png bs=1 count=1 conv=notrunc
# Verify vae-normalizer verify -o output --checksums # Error: image001.png hash mismatch — detected corruption ```
-
Run split:
bash vae-normalizer normalize -d examples/test-dataset -o output -
Check for overlaps: ```bash # Count unique filenames across splits cat output/splits/*.txt | sort | uniq | wc -l # Should equal total file count
# Check no duplicates within splits cat output/splits/random_train.txt | sort | uniq -d # Should be empty (no duplicates) ```
-
Verify ratios: ```bash # Manual calculation train=$(wc -l < output/splits/random_train.txt) test=$(wc -l < output/splits/random_test.txt) val=$(wc -l < output/splits/random_val.txt) calib=$(wc -l < output/splits/random_calibration.txt) total=$train + test + val + calib
echo "Train: $((100 * train / total))% (target 70%)" echo "Test: $((100 * test / total))% (target 15%)" # Should be ±1% of targets ```
-
Install Isabelle: ```bash # On Fedora/RHEL dnf install isabelle
# Or build from source git clone https://github.com/isabelle-prover/isabelle cd isabelle && ./build ```
-
Verify theorems:
bash cd /var/mnt/eclipse/repos/zerostep isabelle build -d . -b VAEDataset_Splits # Output: Build session VAEDataset_Splits — 100% complete -
Inspect proof:
bash cat theories/VAEDataset_Splits.thy | grep "theorem\|lemma" | head # Lists all proven propositions
Open an issue at https://github.com/hyperpolymath/zerostep — all feedback welcome.