ContextLab · jeremymanning · Mar 28, 2026 · Mar 10, 2026 · Mar 11, 2026 · Mar 12, 2026
diff --git a/.gitignore b/.gitignore
@@ -49,3 +49,25 @@ model_weights_*.tar.gz
 
 # HuggingFace models (uploaded to HuggingFace, not needed in git)
 models_hf/
+
+# Harrison's infra artifacts
+runai_notes.md
+start_gpu_workspace.sh
+
+# Speckit artifacts
+.specify/
+specs/
+
+# Oh My Claude Code artifacts
+.omc/
+
+# Claude Code artifacts
+.claude/
+
+# Embedding comparison results (large cached embeddings)
+data/embedding_results/
+
+# Cached data (regenerable)
+data/model_results_ntokens.parquet
+data/t_test_ntokens_cache/
+notes/paper-review-plan.md
diff --git a/.specify/memory/constitution.md b/.specify/memory/constitution.md
@@ -0,0 +1,43 @@
+# LLM Stylometry Constitution
+
+## Core Principles
+
+### I. Scientific Accuracy
+All analyses must produce correct, verifiable results. Every statistical claim must be backed by reproducible computation. No results may be manually adjusted or cherry-picked. When code produces unexpected results, investigate the cause before proceeding — do not paper over anomalies.
+
+### II. Replicability
+Every experiment must be fully reproducible from the repository alone. This means: (a) all data processing steps are scripted, not manual; (b) random seeds are fixed and documented; (c) environment requirements (Python version, package versions) are pinned; (d) pre-computed results include the exact commands used to generate them. A new researcher should be able to clone the repo and reproduce every figure and table.
+
+### III. Robust Documentation
+Code, analyses, and results must be documented at three levels: (a) inline comments for non-obvious logic; (b) docstrings for all public functions with parameter descriptions; (c) README and paper text that explain the *why* behind methodological choices. Documentation must be updated whenever code changes — stale docs are worse than no docs.
+
+### IV. Data Purity
+Training data must be strictly separated from evaluation data. When using language models, training from scratch is preferred over fine-tuning pre-trained models to guarantee that held-out texts were never seen during pre-training. All data provenance must be traceable (e.g., Project Gutenberg IDs).
+
+### V. Statistical Rigor
+All quantitative claims must include appropriate uncertainty estimates (confidence intervals, p-values, or bootstrap ranges). Multiple random seeds (minimum 10) must be used to estimate variability. Effect sizes must accompany significance tests. When fitting models to data, report goodness-of-fit metrics and visualize residuals.
+
+### VI. Backward Compatibility
+New analyses must not break existing functionality. The dual codebase (`code/` for paper scripts, `llm_stylometry/` for the package) must remain consistent. Legacy model naming conventions must continue to work alongside new conventions (e.g., models without `_ntokens=` in their name default to the full token budget).
+
+## Quality Gates
+
+- All tests must pass before merging (`pytest tests/ -v`)
+- Code must be formatted (`black .`) and linted (`ruff check .`)
+- New analyses require at least one test verifying correctness
+- Figures must be generated programmatically, never manually edited
+- Pickle files should include the generation command in comments or companion scripts
+- Sensitive data (credentials, API keys) must never be committed
+
+## Development Workflow
+
+- Work on feature branches; merge to main via pull request
+- Commit frequently with descriptive messages during development
+- Pre-computed results (`.pkl`, `.pkl.gz`) should be regenerable from raw data + code
+- When pandas version constraints exist for serialized data, prefer format-stable alternatives (CSV, Parquet) or document the constraint prominently
+
+## Governance
+
+This constitution governs all code and analysis in the llm-stylometry repository. It supersedes informal conventions. Amendments require documentation of the rationale and must not weaken scientific rigor guarantees.
+
+**Version**: 1.0.0 | **Ratified**: 2026-03-24 | **Last Amended**: 2026-03-24
diff --git a/README.md b/README.md
@@ -19,8 +19,11 @@ llm-stylometry/
 ├── tests/                # Test suite
 ├── run_llm_stylometry.sh # Main CLI wrapper
 ├── remote_train.sh       # GPU cluster training
+├── remote_train_ntokens.sh # Dataset-size sweep on GPU cluster
 ├── check_remote_status.sh # Monitor remote training
-└── sync_models.sh        # Download trained models
+├── check_ntokens_status.sh # Monitor dataset-size sweep
+├── sync_models.sh        # Download trained models
+└── sync_ntokens.sh       # Download dataset-size sweep results
 ```
 
 See folder-specific README files for detailed documentation.
@@ -153,6 +156,18 @@ Training 320 models (baseline + 3 variants) requires a CUDA GPU. See `models/REA
 ./run_llm_stylometry.sh -t -r             # Resume from checkpoints
 ```
 
+**Dataset-size experiments:**
+
+Pre-computed results are available in `data/model_results_ntokens.pkl.gz`, so retraining is not required to generate figures or run analyses. To retrain locally at specific token levels:
+```bash
+N_TRAIN_TOKENS=128608 python code/main.py   # ~20% of full corpus
+N_TRAIN_TOKENS=257216 python code/main.py   # ~40%
+N_TRAIN_TOKENS=385825 python code/main.py   # ~60%
+N_TRAIN_TOKENS=514433 python code/main.py   # ~80%
+```
+
+The full sweep uses 19 token levels from ~33k to ~643k. See `code/constants.py` for the complete list.
+
 **Remote training:**
 
 Requires GPU cluster with SSH access. Create `.ssh/credentials_mycluster.json`:
@@ -168,7 +183,56 @@ Then from local machine:
 ./sync_models.sh --cluster mycluster -a         # Download when complete
 ```
 
-Trains in detached screen session on GPU server. See script help for full options.
+**Remote dataset-size sweep:**
+```bash
+./remote_train_ntokens.sh --cluster mycluster                    # Train all 19 token levels
+./remote_train_ntokens.sh --cluster mycluster --tokens 128608,257216  # Specific levels only
+./remote_train_ntokens.sh --cluster mycluster -r                 # Resume from checkpoints
+./check_ntokens_status.sh --cluster mycluster                    # Monitor sweep progress
+./sync_ntokens.sh --cluster mycluster                            # Download results (configs + logs, not weights)
+```
+
+All remote scripts train in detached screen sessions on the GPU server. See script help (`-h`) for full options.
+
+## Additional Analyses
+
+### Sigmoid Fit Analysis
+
+Fits a sigmoid curve to classification accuracy as a function of log training tokens, reporting the minimum dataset size needed for >=95% expected accuracy:
+
+```bash
+python code/fit_sigmoid.py
+```
+
+**Output:**
+- `paper/figs/source/accuracy_vs_tokens_sigmoid.pdf` (Figure 6)
+- `data/sigmoid_fit_results.json` (fit parameters and threshold)
+
+Uses pre-computed results from `data/model_results_ntokens.pkl.gz` (no retraining needed).
+
+### Embedding Comparison
+
+Compares our cross-entropy approach against text-embedding nearest-neighbor classification using three models from the MTEB leaderboard:
+
+| Model | Parameters |
+|-|-|
+| nomic-ai/nomic-embed-text-v1.5 | 137M |
+| BAAI/bge-m3 | 568M |
+| Qwen/Qwen3-Embedding-4B | 4.0B |
+
+**Prerequisites:**
+```bash
+pip install sentence-transformers
+```
+
+**Usage:**
+```bash
+python code/embedding_comparison.py                                        # Run all 3 models
+python code/embedding_comparison.py --model nomic-ai/nomic-embed-text-v1.5 # Single model
+python code/embedding_comparison.py --figures-only                          # Generate figures from cached results
+```
+
+Results are cached in `data/embedding_results/` so subsequent runs with `--figures-only` skip embedding computation.
 
 ## Data
 
@@ -233,6 +297,10 @@ from llm_stylometry.visualization import (
     generate_oz_losses_figure        # Figure 5: Oz analysis
 )
 
+# Additional standalone analyses (run as scripts)
+# Figure 6: Sigmoid fit — python code/fit_sigmoid.py
+# Figure 7: T-test ntokens — python code/generate_figures.py --figure 7
+
 # Fairness-based loss thresholding (for variant comparisons)
 from llm_stylometry.analysis.fairness import (
     compute_fairness_threshold,      # Compute fairness threshold

diff --git a/check_ntokens_status.sh b/check_ntokens_status.sh
@@ -0,0 +1,214 @@
+#!/bin/bash
+
+# Check ntokens sweep training status on remote GPU server
+#
+# Usage: ./check_ntokens_status.sh --cluster tensor02
+
+set -e
+
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+BLUE='\033[0;34m'
+NC='\033[0m'
+
+print_info() { echo -e "${BLUE}[INFO]${NC} $1"; }
+print_success() { echo -e "${GREEN}[SUCCESS]${NC} $1"; }
+print_warning() { echo -e "${YELLOW}[WARNING]${NC} $1"; }
+print_error() { echo -e "${RED}[ERROR]${NC} $1"; }
+
+CLUSTER=""
+
+while [[ $# -gt 0 ]]; do
+    case $1 in
+        --cluster)
+            CLUSTER="$2"
+            shift 2
+            ;;
+        -h|--help)
+            echo "Usage: $0 --cluster NAME"
+            echo "Check ntokens sweep training status on remote GPU server"
+            exit 0
+            ;;
+        *)
+            print_error "Unknown option: $1"
+            exit 1
+            ;;
+    esac
+done
+
+if [ -z "$CLUSTER" ]; then
+    print_error "Cluster must be specified with --cluster flag"
+    exit 1
+fi
+
+CRED_FILE=".ssh/credentials_${CLUSTER}.json"
+if [ -f "$CRED_FILE" ]; then
+    SERVER_ADDRESS=$(python3 -c "import json; print(json.load(open('$CRED_FILE'))['server'])" 2>/dev/null)
+    USERNAME=$(python3 -c "import json; print(json.load(open('$CRED_FILE'))['username'])" 2>/dev/null)
+    PASSWORD=$(python3 -c "import json; print(json.load(open('$CRED_FILE'))['password'])" 2>/dev/null)
+
+    if [ -z "$SERVER_ADDRESS" ] || [ -z "$USERNAME" ] || [ -z "$PASSWORD" ]; then
+        print_error "Failed to read credentials from $CRED_FILE"
+        exit 1
+    fi
+    USE_SSHPASS=true
+else
+    print_warning "No credentials file at $CRED_FILE"
+    read -p "Enter server address: " SERVER_ADDRESS
+    read -p "Enter username: " USERNAME
+    USE_SSHPASS=false
+fi
+
+if [ "$USE_SSHPASS" = true ]; then
+    if ! command -v sshpass &> /dev/null; then
+        print_error "sshpass required: brew install hudochenkov/sshpass/sshpass"
+        exit 1
+    fi
+    SSH_CMD="sshpass -p '$PASSWORD' ssh -o StrictHostKeyChecking=no"
+else
+    SSH_CMD="ssh"
+fi
+
+print_info "Checking ntokens sweep status on $CLUSTER..."
+echo ""
+
+eval "$SSH_CMD \"$USERNAME@$SERVER_ADDRESS\" 'bash -s'" << 'ENDSSH'
+#!/bin/bash
+
+cd ~/llm-stylometry || { echo "ERROR: Project directory not found"; exit 1; }
+
+if ! command -v conda &> /dev/null; then
+    echo "ERROR: conda not found"
+    exit 1
+fi
+
+eval "$(conda shell.bash hook)" 2>/dev/null || true
+conda activate llm-stylometry 2>/dev/null || true
+
+python3 << 'ENDPYTHON'
+import pandas as pd
+import numpy as np
+from pathlib import Path
+from datetime import datetime
+from collections import defaultdict
+
+models_dir = Path("models")
+if not models_dir.exists():
+    print("ERROR: models/ directory not found")
+    exit(1)
+
+# Token levels we expect
+TOKEN_LEVELS = [2500, 5000, 10000, 20000, 40000, 45000, 50000, 55000,
+                60000, 65000, 70000, 75000, 80000, 128608, 160000,
+                257216, 385825, 514433, 643041]
+AUTHORS = ["austen", "baum", "dickens", "fitzgerald", "melville",
+           "thompson", "twain", "wells"]
+SEEDS = list(range(10))
+
+print("=" * 70)
+print("NTOKENS SWEEP TRAINING STATUS")
+print(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
+print("=" * 70)
+
+total_expected = len(TOKEN_LEVELS) * len(AUTHORS) * len(SEEDS)
+total_complete = 0
+total_in_progress = 0
+total_missing = 0
+
+for n_tokens in TOKEN_LEVELS:
+    complete = 0
+    in_progress = 0
+    missing = 0
+    max_epoch = 0
+    min_loss = float('inf')
+
+    for author in AUTHORS:
+        for seed in SEEDS:
+            if n_tokens == 643041:
+                # Legacy baseline naming (no ntokens in name)
+                model_name = f"{author}_tokenizer=gpt2_seed={seed}"
+            else:
+                model_name = f"{author}_tokenizer=gpt2_ntokens={n_tokens}_seed={seed}"
+
+            model_dir = models_dir / model_name
+            loss_file = model_dir / "loss_logs.csv"
+
+            if not model_dir.exists():
+                missing += 1
+                continue
+
+            if not loss_file.exists():
+                missing += 1
+                continue
+
+            try:
+                df = pd.read_csv(loss_file)
+                if df.empty:
+                    missing += 1
+                    continue
+
+                epoch = df["epochs_completed"].max()
+                train_loss = df[(df["epochs_completed"] == epoch) &
+                               (df["loss_dataset"] == "train")]["loss_value"]
+
+                if not train_loss.empty:
+                    loss = train_loss.iloc[0]
+                    if loss <= 3.0 and epoch >= 500:
+                        complete += 1
+                    else:
+                        in_progress += 1
+                        max_epoch = max(max_epoch, epoch)
+                        min_loss = min(min_loss, loss)
+                else:
+                    in_progress += 1
+            except Exception:
+                missing += 1
+
+    total = len(AUTHORS) * len(SEEDS)
+    total_complete += complete
+    total_in_progress += in_progress
+    total_missing += missing
+
+    status = "✓" if complete == total else "..." if in_progress > 0 else "✗"
+    print(f"\n{n_tokens:>7,} tokens: {complete:>2}/{total} complete", end="")
+    if in_progress > 0:
+        loss_str = f", best loss: {min_loss:.3f}" if min_loss < float('inf') else ""
+        print(f", {in_progress} in progress (max epoch: {max_epoch}{loss_str})", end="")
+    if missing > 0:
+        print(f", {missing} missing", end="")
+    print(f"  [{status}]")
+
+print("\n" + "-" * 70)
+print(f"Total: {total_complete}/{total_expected} complete, "
+      f"{total_in_progress} in progress, {total_missing} missing")
+
+# Check screen session
+import subprocess
+result = subprocess.run(["screen", "-list"], capture_output=True, text=True)
+if "ntokens_training" in result.stdout:
+    print("\n✓ Screen session 'ntokens_training' is active")
+else:
+    print("\n✗ No active 'ntokens_training' screen session")
+
+# Check latest log
+import glob
+logs = sorted(glob.glob("logs/ntokens_training_*.log"))
+if logs:
+    latest = logs[-1]
+    print(f"Latest log: {latest}")
+    # Show last 3 lines
+    with open(latest) as f:
+        lines = f.readlines()
+        for line in lines[-3:]:
+            print(f"  {line.rstrip()}")
+ENDPYTHON
+ENDSSH
+
+if [ $? -eq 0 ]; then
+    echo ""
+    print_success "Status check complete!"
+else
+    print_error "Failed to check status"
+    exit 1
+fi
diff --git a/code/book_stats.py b/code/book_stats.py
@@ -1,10 +1,8 @@
+import pandas as pd
 import requests
-from tqdm import tqdm
-from scipy import stats
-from tokenizer_utils import get_tokenizer
 from constants import AUTHORS, CLEANED_DATA_DIR, DATA_DIR
-import pandas as pd
-
+from tokenizer_utils import get_tokenizer
+from tqdm import tqdm
 
 tokenizer = get_tokenizer("gpt2")
 tokenizer.model_max_length = 1e8