Paper revision: dataset-size analysis, embedding comparison, response letter by jeremymanning · Pull Request #52 · ContextLab/llm-stylometry

jeremymanning · 2026-03-28T04:33:00Z

Summary

Addresses reviewer comments for Computational Linguistics resubmission (Issue #50). Adds two new analyses, updates the paper text, and drafts a point-by-point response letter.

New analyses

Dataset-size sweep: 1,520 models trained at 19 token levels (2,500–643,041). Sigmoid fit to accuracy vs log₁₀ tokens (R²=0.979). Estimated ≥95% accuracy threshold: ~51,000 tokens/author.
Embedding comparison: 3 MTEB-leaderboard models (nomic 137M, bge-m3 568M, Qwen3-4B 4.0B) evaluated via chunk-level nearest-neighbor LOO. Best embedding accuracy: 81.0% vs our 100%. Inverse relationship between model size and accuracy.

Paper updates

New Methods subsections (corpus size analysis, embedding comparison)
New Results subsections with figures (Fig. 6: sigmoid + t-test panels, Fig. 7: embedding comparison bar chart)
Expanded Discussion: Huang et al. (2025) data purity argument, benchmark infeasibility, cross-domain robustness with new citations
Supplementary Materials: embedding purity/confusion figures + comparison table
Regenerated oz_losses figure (previously had empty bottom panels)

Response letter

11-page point-by-point response to editor + 3 reviewers
Full verbatim reviewer text with interleaved bold responses
Inline manuscript quotes and figures
All page references filled in

Infrastructure

run_llm_stylometry.sh: new figure flags 6/7, sentence-transformers dep
3 new remote scripts for ntokens sweep (tested on tensor02)
compile.sh: builds main + supplement + response + latexdiff
Black formatting applied across codebase
15 new tests (sigmoid fit + embedding comparison)

Test plan

pytest tests/test_sigmoid_fit.py — 7 tests pass
pytest tests/test_embedding_comparison.py — 7 tests pass
pytest tests/test_dataset_size_support.py — 1 test passes
All paper documents compile (./paper/compile.sh all)
Fact-checked all numerical claims against computed results
PI visual review of figures

Closes #50

…er updates New analyses for Computational Linguistics revision: - Sigmoid fit to accuracy vs log10(tokens): R²=0.979, ≥95% threshold ≈51K tokens - Embedding comparison pipeline (3 MTEB models: nomic, bge-m3, Qwen3-4B) - Per-book checkpoint/resume support for embedding runs Paper updates: - New methods subsections (data requirements, embedding comparison) - New results subsections with figure references (placeholders for embedding results) - Expanded Discussion: Huang et al. (2025) comparison, benchmark feasibility argument - MTEB citation added to bibliography - Response letter draft (paper/admin/response_letter.tex) Code changes: - Converted model_results_ntokens.pkl.gz → Parquet (format-stable) - Removed brittle pd.__version__ assertions from 3 files - Extracted __main__ from visualization library to standalone script - New figures: accuracy_vs_tokens_sigmoid.pdf, t_test_ntokens.pdf, n_tokens.pdf - Replaced old grid/avg ntokens figures with single-panel designs Tests: 14 new tests (7 sigmoid, 7 embedding) all passing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Three new scripts (existing scripts unchanged): - remote_train_ntokens.sh: launch sweep on GPU cluster - check_ntokens_status.sh: check training progress per token level - sync_ntokens.sh: download configs/loss logs (not weights) from cluster Tested on tensor02.dartmouth.edu (8xA6000): connection OK, status check works. Credentials files created for tensor01 and tensor02 (gitignored). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Session notes with detailed progress tracking - Constitution with scientific rigor principles - Spec, plan, tasks for paper revision (67 tasks across 8 phases) - tensor02 tested: ntokens scripts working, all 1520 models present - bge-m3 embedding results: 76.2% accuracy (completed) - Qwen3-4B running locally Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- run_llm_stylometry.sh: figure flags 6/7, sentence-transformers/pyarrow deps, auto-run - generate_figures.py: dispatch for figures 6 (sigmoid) and 7 (t-test ntokens) - run_stats.sh: ntokens stats, sigmoid fit results, embedding comparison summaries - README.md: document dataset-size experiments, sigmoid fit, embedding comparison, remote ntokens scripts with usage examples Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Results section: filled in nomic and bge-m3 accuracy, Qwen placeholder remains - Discussion: substantive text on why embeddings underperform (content vs style) - Supplement: added embedding appendix with table, purity/confusion figures - Response letter: filled in embedding summary, fixed section references Only remaining PLACEHOLDERs are for Qwen3-4B accuracy (running locally). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Response letter: full verbatim reviewer comments with point-by-point responses (8 pages) - Paper results: filled nomic 81% and bge-m3 76.2% accuracy, Qwen placeholder remains - Paper discussion: substantive embedding interpretation (content vs style conflation) - Supplement: embedding appendix with table, purity/confusion figures, interpretation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

All claims depending on Qwen3-4B results are now either: - Highlighted yellow (\colorbox{yellow}{...}) in response letter - Marked with % NOTE/TODO/VERIFY comments in main.tex and supplement.tex - Explicit [PLACEHOLDER] or TBD markers Inventory: 3 numbers to fill, ~7 interpretive claims to verify, 14 page refs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Response letter: \parskip set to \baselineskip (single blank line between paragraphs) - requirements-dev.txt: added sentence-transformers and pyarrow Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Reformatted 80 Python files with black (line length 88). Applied ruff auto-fixes where applicable. No functional changes — formatting only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Qwen3-Embedding-4B complete: 70.2% (59/84) — worst of 3 models. Inverse size-accuracy relationship: 81.0% (137M) > 76.2% (568M) > 70.2% (4B). - Filled all PLACEHOLDERs in main.tex, supplement.tex, response_letter.tex - Removed all yellow highlights (all claims now verified) - Updated supplement interpretation with full 3-model findings - Regenerated all 3 embedding figures with complete data - All paper documents compile cleanly (31 + 10 + 11 pages) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Results section: expanded with inverse size-accuracy pattern (81% > 76% > 70%), per-author findings (Qwen fails on Fitzgerald 0/8, Thompson 38.5%), and cross-reference to supplementary table. Discussion: added inverse pattern interpretation (larger models may amplify content similarity at expense of stylistic distinction), Dickens magnet observation, contrast with author-specific training. Supplement: updated interpretation with full 3-model findings. Removed all stale TODO/VERIFIED comments. Fixed cross-document table reference (tab:embedding-comparison -> Supp Table 1). All documents compile with zero undefined references. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…nt bug - Added embedding comparison figure (Fig. embedding-comparison) to main paper - Fixed supplement Supp. Fig. 6: wrong file (content_only -> pos) - Tightened results paragraph: removed redundant methodology recap - Tightened discussion paragraph: removed repeated accuracy numbers - All documents compile cleanly (31 + 10 + 11 pages) - Zero actual undefined references Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Line 36: Split long email addresses across 3 lines with \small - Line 877: Added \small to Austen/Twain corpus table - Line 897: Added \small to Fitzgerald/Wells corpus table - Supplement: Fixed POS t-stats figure (was using content_only.pdf) Zero overfull hbox warnings after fix. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Previously 19KB with empty bottom panels (Contested, Non-Oz Baum, Non-Oz Thompson). Now 192KB with all 6 panels populated correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Fixed \author{} missing closing brace + added \date{} - Added embedding figure macros (\embeddingpurity, \embeddingconfusion) - Main text references Supp. Figs 9 and 10 for embedding details - Supplement: moved embedding figures before table - Added old.tex (from main branch) for latexdiff - Updated compile.sh with diff target - All 4 documents compile (main, supplement, response, diff) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Removed all narrative text from supplement embedding section (methods description, notable patterns discussion) - Added MTEB ranking point to main text results paragraph - Supplement now contains only figures, tables, and captions - Added \clearpage before embedding table to enforce ordering - Figures appear before tables in all supplement sections Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Used \captionof{figure} instead of figure* floats to force exact placement. Added caption package. Embedding purity (Fig 9) and confusion (Fig 10) now appear on page 8, before the embedding table (Table 4) on page 9. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Added old_supplement.tex from main branch for latexdiff - compile.sh now generates both main and supplement diffs - Fixed cd issue: cd back to SCRIPT_DIR before compile_diff in 'all' case - All 5 documents compile: main, supplement, response, diff, diff_supplement Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Added 3 new citations (verified via web search): - Stamatatos (2009): JASIST 60(3), 538-556 — survey of attribution methods - Stamatatos (2018): JASIST 69(3), 461-473 — topic masking for cross-topic AA - Fincke & Boschee (2024): arXiv:2408.05192 — cross-genre data selection Expanded Discussion cross-domain paragraph with: - Attribution degrades across genres/topics (Stam09, BarlStam20) - Topic masking helps cross-topic (Stam18) - Multi-genre pooling alone doesn't help (FincBosc24) - Oz analysis is preliminary evidence, not controlled evaluation - Cross-genre/register evaluation identified as future work Updated response letter to accurately describe the new discussion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Fixed quote 2: added full sentence ending ("because the only text...") - Updated section references to match renamed sections: "Training data requirements" -> "Corpus size analysis" "Comparison with text embedding methods" -> "Comparisons between predictive comparison and text embedding matching" - All 3 inline paper quotes verified verbatim against main.tex Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Grammar: - "GPT-2 implicitly learn" -> "learns" (subject-verb agreement) Redundancy: - Removed duplicate "characterize the data requirements" phrase Also added .specify/ and specs/ to .gitignore. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Line 641: comma splice fixed with semicolons (Blogs50; CCAT50; Guardian) - Line 652: "These would not" -> "These analyses would not" (clarify antecedent) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Page numbers remain yellow-highlighted for easy updating if pages shift. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Parquet file was 140MB (over GitHub's 100MB limit). Re-serialized pkl.gz from parquet using current numpy 1.x (98MB, under limit). The original pkl.gz required numpy 2.x; this new one is compatible with numpy 1.x. Updated all references across 15 files (Python, shell, README, gitignore). All 8 tests pass with the new pkl.gz. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

harrisonstropkay-blip and others added 30 commits March 10, 2026 00:19

add minimal model-training shims

080a733

data consolidation --include_ntokens

1ea20d4

gitignore Harrison's infra artifacts

4fa7715

20%, 40%, 60%, and 80% runs

f28dbe0

compressed results

55e73d4

ntoken t-tests and results pkl

952f856

cache t-test results; add legend to fig; compute t-test stats

aac3d1e

catch empty df error

c4bc3ca

7 more runs

2d91a8b

results pkl

e54be3e

t test cache

4054577

results: 12 runs total for ntokens

d7c14d3

runs

9d4f3df

code

b4ba638

data

f5ee9a9

Update session notes with final status

7ee439b

Final session notes with Qwen resume instructions

b5885f4

Apply black formatting and ruff fixes across entire codebase

c44f4a4

Reformatted 80 Python files with black (line length 88). Applied ruff auto-fixes where applicable. No functional changes — formatting only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update session notes with comprehensive status

5750684

Add paper compile script (main + supplement + response letter)

26e91a6

jeremymanning and others added 15 commits March 26, 2026 10:31

Regenerate oz_losses figure — fix empty bottom panels

1cb50d2

Previously 19KB with empty bottom panels (Contested, Non-Oz Baum, Non-Oz Thompson). Now 192KB with all 6 panels populated correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Additional proofread fixes from agent review

8778b96

- Line 641: comma splice fixed with semicolons (Blogs50; CCAT50; Guardian) - Line 652: "These would not" -> "These analyses would not" (clarify antecedent) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fill in all 15 page number references in response letter

ec53d25

Page numbers remain yellow-highlighted for easy updating if pages shift. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add .omc/ and .claude/ to .gitignore

cf36173

paper updates

d1443ad

jeremymanning merged commit 1411b89 into main Mar 28, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Paper revision: dataset-size analysis, embedding comparison, response letter#52

Paper revision: dataset-size analysis, embedding comparison, response letter#52
jeremymanning merged 45 commits intomainfrom
001-paper-revision-analyses

jeremymanning commented Mar 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jeremymanning commented Mar 28, 2026

Summary

New analyses

Paper updates

Response letter

Infrastructure

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants