Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
080a733
add minimal model-training shims
harrisonstropkay-blip Mar 10, 2026
1ea20d4
data consolidation --include_ntokens
harrisonstropkay-blip Mar 11, 2026
4fa7715
gitignore Harrison's infra artifacts
harrisonstropkay-blip Mar 12, 2026
f28dbe0
20%, 40%, 60%, and 80% runs
harrisonstropkay-blip Mar 12, 2026
55e73d4
compressed results
harrisonstropkay-blip Mar 12, 2026
952f856
ntoken t-tests and results pkl
harrisonstropkay-blip Mar 13, 2026
aac3d1e
cache t-test results; add legend to fig; compute t-test stats
harrisonstropkay-blip Mar 13, 2026
c4bc3ca
catch empty df error
harrisonstropkay-blip Mar 15, 2026
2d91a8b
7 more runs
harrisonstropkay-blip Mar 19, 2026
e54be3e
results pkl
harrisonstropkay-blip Mar 19, 2026
4054577
t test cache
harrisonstropkay-blip Mar 19, 2026
d7c14d3
results: 12 runs total for ntokens
harrisonstropkay-blip Mar 19, 2026
9d4f3df
runs
harrisonstropkay-blip Mar 23, 2026
b4ba638
code
harrisonstropkay-blip Mar 23, 2026
f5ee9a9
data
harrisonstropkay-blip Mar 23, 2026
b5e355f
Add dataset-size analysis, sigmoid fit, embedding comparison, and pap…
jeremymanning Mar 24, 2026
ca28ff1
Add remote scripts for ntokens dataset-size sweep
jeremymanning Mar 24, 2026
2bd5d50
Update session notes, speckit artifacts, and constitution
jeremymanning Mar 24, 2026
ab0cba4
Update CLI, stats script, and README for new analyses
jeremymanning Mar 24, 2026
7ee439b
Update session notes with final status
jeremymanning Mar 24, 2026
76ea45c
Update paper text with embedding results (nomic 81%, bge-m3 76.2%)
jeremymanning Mar 24, 2026
b5885f4
Final session notes with Qwen resume instructions
jeremymanning Mar 25, 2026
9f3051d
Finalize response letter and paper text
jeremymanning Mar 25, 2026
27932d5
Audit and highlight all unverified claims in paper files
jeremymanning Mar 25, 2026
e5c65a5
Fix response letter paragraph spacing, add deps to requirements-dev.txt
jeremymanning Mar 25, 2026
c44f4a4
Apply black formatting and ruff fixes across entire codebase
jeremymanning Mar 25, 2026
5750684
Update session notes with comprehensive status
jeremymanning Mar 25, 2026
26e91a6
Add paper compile script (main + supplement + response letter)
jeremymanning Mar 25, 2026
a0e71da
Finalize all embedding results and update paper
jeremymanning Mar 25, 2026
f91fae5
Complete embedding integration into paper text
jeremymanning Mar 26, 2026
b31319f
Polish paper text: add embedding figure, fix redundancy, fix suppleme…
jeremymanning Mar 26, 2026
fb97e74
Fix all text overflow issues (3 overfull hbox warnings)
jeremymanning Mar 26, 2026
1cb50d2
Regenerate oz_losses figure — fix empty bottom panels
jeremymanning Mar 26, 2026
55791eb
Fix author block, add diff support, reorder supplement figures
jeremymanning Mar 26, 2026
ff47266
Strip narrative from supplement, integrate into main text
jeremymanning Mar 26, 2026
73c5f28
Fix supplement figure ordering — embedding figs now appear before table
jeremymanning Mar 26, 2026
b95d838
Add supplement diff, fix compile.sh cd issue
jeremymanning Mar 26, 2026
993e537
Expand cross-domain discussion with verified citations
jeremymanning Mar 27, 2026
ebc335b
Verify response letter quotes and fix section name mismatches
jeremymanning Mar 27, 2026
4bf6697
Proofread fixes + add speckit to .gitignore
jeremymanning Mar 27, 2026
8778b96
Additional proofread fixes from agent review
jeremymanning Mar 27, 2026
ec53d25
Fill in all 15 page number references in response letter
jeremymanning Mar 27, 2026
cf36173
Add .omc/ and .claude/ to .gitignore
jeremymanning Mar 27, 2026
d1443ad
paper updates
jeremymanning Mar 27, 2026
fd7eeef
Switch from parquet back to pkl.gz for GitHub compatibility
jeremymanning Mar 28, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
The diff you're trying to view is too large. We only load the first 3000 changed files.
22 changes: 22 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -49,3 +49,25 @@ model_weights_*.tar.gz

# HuggingFace models (uploaded to HuggingFace, not needed in git)
models_hf/

# Harrison's infra artifacts
runai_notes.md
start_gpu_workspace.sh

# Speckit artifacts
.specify/
specs/

# Oh My Claude Code artifacts
.omc/

# Claude Code artifacts
.claude/

# Embedding comparison results (large cached embeddings)
data/embedding_results/

# Cached data (regenerable)
data/model_results_ntokens.parquet
data/t_test_ntokens_cache/
notes/paper-review-plan.md
43 changes: 43 additions & 0 deletions .specify/memory/constitution.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# LLM Stylometry Constitution

## Core Principles

### I. Scientific Accuracy
All analyses must produce correct, verifiable results. Every statistical claim must be backed by reproducible computation. No results may be manually adjusted or cherry-picked. When code produces unexpected results, investigate the cause before proceeding — do not paper over anomalies.

### II. Replicability
Every experiment must be fully reproducible from the repository alone. This means: (a) all data processing steps are scripted, not manual; (b) random seeds are fixed and documented; (c) environment requirements (Python version, package versions) are pinned; (d) pre-computed results include the exact commands used to generate them. A new researcher should be able to clone the repo and reproduce every figure and table.

### III. Robust Documentation
Code, analyses, and results must be documented at three levels: (a) inline comments for non-obvious logic; (b) docstrings for all public functions with parameter descriptions; (c) README and paper text that explain the *why* behind methodological choices. Documentation must be updated whenever code changes — stale docs are worse than no docs.

### IV. Data Purity
Training data must be strictly separated from evaluation data. When using language models, training from scratch is preferred over fine-tuning pre-trained models to guarantee that held-out texts were never seen during pre-training. All data provenance must be traceable (e.g., Project Gutenberg IDs).

### V. Statistical Rigor
All quantitative claims must include appropriate uncertainty estimates (confidence intervals, p-values, or bootstrap ranges). Multiple random seeds (minimum 10) must be used to estimate variability. Effect sizes must accompany significance tests. When fitting models to data, report goodness-of-fit metrics and visualize residuals.

### VI. Backward Compatibility
New analyses must not break existing functionality. The dual codebase (`code/` for paper scripts, `llm_stylometry/` for the package) must remain consistent. Legacy model naming conventions must continue to work alongside new conventions (e.g., models without `_ntokens=` in their name default to the full token budget).

## Quality Gates

- All tests must pass before merging (`pytest tests/ -v`)
- Code must be formatted (`black .`) and linted (`ruff check .`)
- New analyses require at least one test verifying correctness
- Figures must be generated programmatically, never manually edited
- Pickle files should include the generation command in comments or companion scripts
- Sensitive data (credentials, API keys) must never be committed

## Development Workflow

- Work on feature branches; merge to main via pull request
- Commit frequently with descriptive messages during development
- Pre-computed results (`.pkl`, `.pkl.gz`) should be regenerable from raw data + code
- When pandas version constraints exist for serialized data, prefer format-stable alternatives (CSV, Parquet) or document the constraint prominently

## Governance

This constitution governs all code and analysis in the llm-stylometry repository. It supersedes informal conventions. Amendments require documentation of the rationale and must not weaken scientific rigor guarantees.

**Version**: 1.0.0 | **Ratified**: 2026-03-24 | **Last Amended**: 2026-03-24
72 changes: 70 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,11 @@ llm-stylometry/
├── tests/ # Test suite
├── run_llm_stylometry.sh # Main CLI wrapper
├── remote_train.sh # GPU cluster training
├── remote_train_ntokens.sh # Dataset-size sweep on GPU cluster
├── check_remote_status.sh # Monitor remote training
└── sync_models.sh # Download trained models
├── check_ntokens_status.sh # Monitor dataset-size sweep
├── sync_models.sh # Download trained models
└── sync_ntokens.sh # Download dataset-size sweep results
```

See folder-specific README files for detailed documentation.
Expand Down Expand Up @@ -153,6 +156,18 @@ Training 320 models (baseline + 3 variants) requires a CUDA GPU. See `models/REA
./run_llm_stylometry.sh -t -r # Resume from checkpoints
```

**Dataset-size experiments:**

Pre-computed results are available in `data/model_results_ntokens.pkl.gz`, so retraining is not required to generate figures or run analyses. To retrain locally at specific token levels:
```bash
N_TRAIN_TOKENS=128608 python code/main.py # ~20% of full corpus
N_TRAIN_TOKENS=257216 python code/main.py # ~40%
N_TRAIN_TOKENS=385825 python code/main.py # ~60%
N_TRAIN_TOKENS=514433 python code/main.py # ~80%
```

The full sweep uses 19 token levels from ~33k to ~643k. See `code/constants.py` for the complete list.

**Remote training:**

Requires GPU cluster with SSH access. Create `.ssh/credentials_mycluster.json`:
Expand All @@ -168,7 +183,56 @@ Then from local machine:
./sync_models.sh --cluster mycluster -a # Download when complete
```

Trains in detached screen session on GPU server. See script help for full options.
**Remote dataset-size sweep:**
```bash
./remote_train_ntokens.sh --cluster mycluster # Train all 19 token levels
./remote_train_ntokens.sh --cluster mycluster --tokens 128608,257216 # Specific levels only
./remote_train_ntokens.sh --cluster mycluster -r # Resume from checkpoints
./check_ntokens_status.sh --cluster mycluster # Monitor sweep progress
./sync_ntokens.sh --cluster mycluster # Download results (configs + logs, not weights)
```

All remote scripts train in detached screen sessions on the GPU server. See script help (`-h`) for full options.

## Additional Analyses

### Sigmoid Fit Analysis

Fits a sigmoid curve to classification accuracy as a function of log training tokens, reporting the minimum dataset size needed for >=95% expected accuracy:

```bash
python code/fit_sigmoid.py
```

**Output:**
- `paper/figs/source/accuracy_vs_tokens_sigmoid.pdf` (Figure 6)
- `data/sigmoid_fit_results.json` (fit parameters and threshold)

Uses pre-computed results from `data/model_results_ntokens.pkl.gz` (no retraining needed).

### Embedding Comparison

Compares our cross-entropy approach against text-embedding nearest-neighbor classification using three models from the MTEB leaderboard:

| Model | Parameters |
|-|-|
| nomic-ai/nomic-embed-text-v1.5 | 137M |
| BAAI/bge-m3 | 568M |
| Qwen/Qwen3-Embedding-4B | 4.0B |

**Prerequisites:**
```bash
pip install sentence-transformers
```

**Usage:**
```bash
python code/embedding_comparison.py # Run all 3 models
python code/embedding_comparison.py --model nomic-ai/nomic-embed-text-v1.5 # Single model
python code/embedding_comparison.py --figures-only # Generate figures from cached results
```

Results are cached in `data/embedding_results/` so subsequent runs with `--figures-only` skip embedding computation.

## Data

Expand Down Expand Up @@ -233,6 +297,10 @@ from llm_stylometry.visualization import (
generate_oz_losses_figure # Figure 5: Oz analysis
)

# Additional standalone analyses (run as scripts)
# Figure 6: Sigmoid fit — python code/fit_sigmoid.py
# Figure 7: T-test ntokens — python code/generate_figures.py --figure 7

# Fairness-based loss thresholding (for variant comparisons)
from llm_stylometry.analysis.fairness import (
compute_fairness_threshold, # Compute fairness threshold
Expand Down
214 changes: 214 additions & 0 deletions check_ntokens_status.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,214 @@
#!/bin/bash

# Check ntokens sweep training status on remote GPU server
#
# Usage: ./check_ntokens_status.sh --cluster tensor02

set -e

RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m'

print_info() { echo -e "${BLUE}[INFO]${NC} $1"; }
print_success() { echo -e "${GREEN}[SUCCESS]${NC} $1"; }
print_warning() { echo -e "${YELLOW}[WARNING]${NC} $1"; }
print_error() { echo -e "${RED}[ERROR]${NC} $1"; }

CLUSTER=""

while [[ $# -gt 0 ]]; do
case $1 in
--cluster)
CLUSTER="$2"
shift 2
;;
-h|--help)
echo "Usage: $0 --cluster NAME"
echo "Check ntokens sweep training status on remote GPU server"
exit 0
;;
*)
print_error "Unknown option: $1"
exit 1
;;
esac
done

if [ -z "$CLUSTER" ]; then
print_error "Cluster must be specified with --cluster flag"
exit 1
fi

CRED_FILE=".ssh/credentials_${CLUSTER}.json"
if [ -f "$CRED_FILE" ]; then
SERVER_ADDRESS=$(python3 -c "import json; print(json.load(open('$CRED_FILE'))['server'])" 2>/dev/null)
USERNAME=$(python3 -c "import json; print(json.load(open('$CRED_FILE'))['username'])" 2>/dev/null)
PASSWORD=$(python3 -c "import json; print(json.load(open('$CRED_FILE'))['password'])" 2>/dev/null)

if [ -z "$SERVER_ADDRESS" ] || [ -z "$USERNAME" ] || [ -z "$PASSWORD" ]; then
print_error "Failed to read credentials from $CRED_FILE"
exit 1
fi
USE_SSHPASS=true
else
print_warning "No credentials file at $CRED_FILE"
read -p "Enter server address: " SERVER_ADDRESS
read -p "Enter username: " USERNAME
USE_SSHPASS=false
fi

if [ "$USE_SSHPASS" = true ]; then
if ! command -v sshpass &> /dev/null; then
print_error "sshpass required: brew install hudochenkov/sshpass/sshpass"
exit 1
fi
SSH_CMD="sshpass -p '$PASSWORD' ssh -o StrictHostKeyChecking=no"
else
SSH_CMD="ssh"
fi

print_info "Checking ntokens sweep status on $CLUSTER..."
echo ""

eval "$SSH_CMD \"$USERNAME@$SERVER_ADDRESS\" 'bash -s'" << 'ENDSSH'
#!/bin/bash

cd ~/llm-stylometry || { echo "ERROR: Project directory not found"; exit 1; }

if ! command -v conda &> /dev/null; then
echo "ERROR: conda not found"
exit 1
fi

eval "$(conda shell.bash hook)" 2>/dev/null || true
conda activate llm-stylometry 2>/dev/null || true

python3 << 'ENDPYTHON'
import pandas as pd
import numpy as np
from pathlib import Path
from datetime import datetime
from collections import defaultdict

models_dir = Path("models")
if not models_dir.exists():
print("ERROR: models/ directory not found")
exit(1)

# Token levels we expect
TOKEN_LEVELS = [2500, 5000, 10000, 20000, 40000, 45000, 50000, 55000,
60000, 65000, 70000, 75000, 80000, 128608, 160000,
257216, 385825, 514433, 643041]
AUTHORS = ["austen", "baum", "dickens", "fitzgerald", "melville",
"thompson", "twain", "wells"]
SEEDS = list(range(10))

print("=" * 70)
print("NTOKENS SWEEP TRAINING STATUS")
print(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("=" * 70)

total_expected = len(TOKEN_LEVELS) * len(AUTHORS) * len(SEEDS)
total_complete = 0
total_in_progress = 0
total_missing = 0

for n_tokens in TOKEN_LEVELS:
complete = 0
in_progress = 0
missing = 0
max_epoch = 0
min_loss = float('inf')

for author in AUTHORS:
for seed in SEEDS:
if n_tokens == 643041:
# Legacy baseline naming (no ntokens in name)
model_name = f"{author}_tokenizer=gpt2_seed={seed}"
else:
model_name = f"{author}_tokenizer=gpt2_ntokens={n_tokens}_seed={seed}"

model_dir = models_dir / model_name
loss_file = model_dir / "loss_logs.csv"

if not model_dir.exists():
missing += 1
continue

if not loss_file.exists():
missing += 1
continue

try:
df = pd.read_csv(loss_file)
if df.empty:
missing += 1
continue

epoch = df["epochs_completed"].max()
train_loss = df[(df["epochs_completed"] == epoch) &
(df["loss_dataset"] == "train")]["loss_value"]

if not train_loss.empty:
loss = train_loss.iloc[0]
if loss <= 3.0 and epoch >= 500:
complete += 1
else:
in_progress += 1
max_epoch = max(max_epoch, epoch)
min_loss = min(min_loss, loss)
else:
in_progress += 1
except Exception:
missing += 1

total = len(AUTHORS) * len(SEEDS)
total_complete += complete
total_in_progress += in_progress
total_missing += missing

status = "✓" if complete == total else "..." if in_progress > 0 else "✗"
print(f"\n{n_tokens:>7,} tokens: {complete:>2}/{total} complete", end="")
if in_progress > 0:
loss_str = f", best loss: {min_loss:.3f}" if min_loss < float('inf') else ""
print(f", {in_progress} in progress (max epoch: {max_epoch}{loss_str})", end="")
if missing > 0:
print(f", {missing} missing", end="")
print(f" [{status}]")

print("\n" + "-" * 70)
print(f"Total: {total_complete}/{total_expected} complete, "
f"{total_in_progress} in progress, {total_missing} missing")

# Check screen session
import subprocess
result = subprocess.run(["screen", "-list"], capture_output=True, text=True)
if "ntokens_training" in result.stdout:
print("\n✓ Screen session 'ntokens_training' is active")
else:
print("\n✗ No active 'ntokens_training' screen session")

# Check latest log
import glob
logs = sorted(glob.glob("logs/ntokens_training_*.log"))
if logs:
latest = logs[-1]
print(f"Latest log: {latest}")
# Show last 3 lines
with open(latest) as f:
lines = f.readlines()
for line in lines[-3:]:
print(f" {line.rstrip()}")
ENDPYTHON
ENDSSH

if [ $? -eq 0 ]; then
echo ""
print_success "Status check complete!"
else
print_error "Failed to check status"
exit 1
fi
8 changes: 3 additions & 5 deletions code/book_stats.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,8 @@
import pandas as pd
import requests
from tqdm import tqdm
from scipy import stats
from tokenizer_utils import get_tokenizer
from constants import AUTHORS, CLEANED_DATA_DIR, DATA_DIR
import pandas as pd

from tokenizer_utils import get_tokenizer
from tqdm import tqdm

tokenizer = get_tokenizer("gpt2")
tokenizer.model_max_length = 1e8
Expand Down
Loading
Loading