Skip to content

Paper revision: dataset-size analysis, embedding comparison, response letter#52

Merged
jeremymanning merged 45 commits intomainfrom
001-paper-revision-analyses
Mar 28, 2026
Merged

Paper revision: dataset-size analysis, embedding comparison, response letter#52
jeremymanning merged 45 commits intomainfrom
001-paper-revision-analyses

Conversation

@jeremymanning
Copy link
Copy Markdown
Member

Summary

Addresses reviewer comments for Computational Linguistics resubmission (Issue #50). Adds two new analyses, updates the paper text, and drafts a point-by-point response letter.

New analyses

  • Dataset-size sweep: 1,520 models trained at 19 token levels (2,500–643,041). Sigmoid fit to accuracy vs log₁₀ tokens (R²=0.979). Estimated ≥95% accuracy threshold: ~51,000 tokens/author.
  • Embedding comparison: 3 MTEB-leaderboard models (nomic 137M, bge-m3 568M, Qwen3-4B 4.0B) evaluated via chunk-level nearest-neighbor LOO. Best embedding accuracy: 81.0% vs our 100%. Inverse relationship between model size and accuracy.

Paper updates

  • New Methods subsections (corpus size analysis, embedding comparison)
  • New Results subsections with figures (Fig. 6: sigmoid + t-test panels, Fig. 7: embedding comparison bar chart)
  • Expanded Discussion: Huang et al. (2025) data purity argument, benchmark infeasibility, cross-domain robustness with new citations
  • Supplementary Materials: embedding purity/confusion figures + comparison table
  • Regenerated oz_losses figure (previously had empty bottom panels)

Response letter

  • 11-page point-by-point response to editor + 3 reviewers
  • Full verbatim reviewer text with interleaved bold responses
  • Inline manuscript quotes and figures
  • All page references filled in

Infrastructure

  • run_llm_stylometry.sh: new figure flags 6/7, sentence-transformers dep
  • 3 new remote scripts for ntokens sweep (tested on tensor02)
  • compile.sh: builds main + supplement + response + latexdiff
  • Black formatting applied across codebase
  • 15 new tests (sigmoid fit + embedding comparison)

Test plan

  • pytest tests/test_sigmoid_fit.py — 7 tests pass
  • pytest tests/test_embedding_comparison.py — 7 tests pass
  • pytest tests/test_dataset_size_support.py — 1 test passes
  • All paper documents compile (./paper/compile.sh all)
  • Fact-checked all numerical claims against computed results
  • PI visual review of figures

Closes #50

harrisonstropkay-blip and others added 30 commits March 10, 2026 00:19
…er updates

New analyses for Computational Linguistics revision:
- Sigmoid fit to accuracy vs log10(tokens): R²=0.979, ≥95% threshold ≈51K tokens
- Embedding comparison pipeline (3 MTEB models: nomic, bge-m3, Qwen3-4B)
- Per-book checkpoint/resume support for embedding runs

Paper updates:
- New methods subsections (data requirements, embedding comparison)
- New results subsections with figure references (placeholders for embedding results)
- Expanded Discussion: Huang et al. (2025) comparison, benchmark feasibility argument
- MTEB citation added to bibliography
- Response letter draft (paper/admin/response_letter.tex)

Code changes:
- Converted model_results_ntokens.pkl.gz → Parquet (format-stable)
- Removed brittle pd.__version__ assertions from 3 files
- Extracted __main__ from visualization library to standalone script
- New figures: accuracy_vs_tokens_sigmoid.pdf, t_test_ntokens.pdf, n_tokens.pdf
- Replaced old grid/avg ntokens figures with single-panel designs

Tests: 14 new tests (7 sigmoid, 7 embedding) all passing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three new scripts (existing scripts unchanged):
- remote_train_ntokens.sh: launch sweep on GPU cluster
- check_ntokens_status.sh: check training progress per token level
- sync_ntokens.sh: download configs/loss logs (not weights) from cluster

Tested on tensor02.dartmouth.edu (8xA6000): connection OK, status check works.
Credentials files created for tensor01 and tensor02 (gitignored).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Session notes with detailed progress tracking
- Constitution with scientific rigor principles
- Spec, plan, tasks for paper revision (67 tasks across 8 phases)
- tensor02 tested: ntokens scripts working, all 1520 models present
- bge-m3 embedding results: 76.2% accuracy (completed)
- Qwen3-4B running locally

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- run_llm_stylometry.sh: figure flags 6/7, sentence-transformers/pyarrow deps, auto-run
- generate_figures.py: dispatch for figures 6 (sigmoid) and 7 (t-test ntokens)
- run_stats.sh: ntokens stats, sigmoid fit results, embedding comparison summaries
- README.md: document dataset-size experiments, sigmoid fit, embedding comparison,
  remote ntokens scripts with usage examples

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Results section: filled in nomic and bge-m3 accuracy, Qwen placeholder remains
- Discussion: substantive text on why embeddings underperform (content vs style)
- Supplement: added embedding appendix with table, purity/confusion figures
- Response letter: filled in embedding summary, fixed section references

Only remaining PLACEHOLDERs are for Qwen3-4B accuracy (running locally).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Response letter: full verbatim reviewer comments with point-by-point responses (8 pages)
- Paper results: filled nomic 81% and bge-m3 76.2% accuracy, Qwen placeholder remains
- Paper discussion: substantive embedding interpretation (content vs style conflation)
- Supplement: embedding appendix with table, purity/confusion figures, interpretation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All claims depending on Qwen3-4B results are now either:
- Highlighted yellow (\colorbox{yellow}{...}) in response letter
- Marked with % NOTE/TODO/VERIFY comments in main.tex and supplement.tex
- Explicit [PLACEHOLDER] or TBD markers

Inventory: 3 numbers to fill, ~7 interpretive claims to verify, 14 page refs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Response letter: \parskip set to \baselineskip (single blank line between paragraphs)
- requirements-dev.txt: added sentence-transformers and pyarrow

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reformatted 80 Python files with black (line length 88).
Applied ruff auto-fixes where applicable.
No functional changes — formatting only.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Qwen3-Embedding-4B complete: 70.2% (59/84) — worst of 3 models.
Inverse size-accuracy relationship: 81.0% (137M) > 76.2% (568M) > 70.2% (4B).

- Filled all PLACEHOLDERs in main.tex, supplement.tex, response_letter.tex
- Removed all yellow highlights (all claims now verified)
- Updated supplement interpretation with full 3-model findings
- Regenerated all 3 embedding figures with complete data
- All paper documents compile cleanly (31 + 10 + 11 pages)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Results section: expanded with inverse size-accuracy pattern (81% > 76% > 70%),
per-author findings (Qwen fails on Fitzgerald 0/8, Thompson 38.5%),
and cross-reference to supplementary table.

Discussion: added inverse pattern interpretation (larger models may amplify
content similarity at expense of stylistic distinction), Dickens magnet
observation, contrast with author-specific training.

Supplement: updated interpretation with full 3-model findings.
Removed all stale TODO/VERIFIED comments.
Fixed cross-document table reference (tab:embedding-comparison -> Supp Table 1).
All documents compile with zero undefined references.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jeremymanning and others added 15 commits March 26, 2026 10:31
…nt bug

- Added embedding comparison figure (Fig. embedding-comparison) to main paper
- Fixed supplement Supp. Fig. 6: wrong file (content_only -> pos)
- Tightened results paragraph: removed redundant methodology recap
- Tightened discussion paragraph: removed repeated accuracy numbers
- All documents compile cleanly (31 + 10 + 11 pages)
- Zero actual undefined references

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Line 36: Split long email addresses across 3 lines with \small
- Line 877: Added \small to Austen/Twain corpus table
- Line 897: Added \small to Fitzgerald/Wells corpus table
- Supplement: Fixed POS t-stats figure (was using content_only.pdf)

Zero overfull hbox warnings after fix.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously 19KB with empty bottom panels (Contested, Non-Oz Baum, Non-Oz
Thompson). Now 192KB with all 6 panels populated correctly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fixed \author{} missing closing brace + added \date{}
- Added embedding figure macros (\embeddingpurity, \embeddingconfusion)
- Main text references Supp. Figs 9 and 10 for embedding details
- Supplement: moved embedding figures before table
- Added old.tex (from main branch) for latexdiff
- Updated compile.sh with diff target
- All 4 documents compile (main, supplement, response, diff)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Removed all narrative text from supplement embedding section
  (methods description, notable patterns discussion)
- Added MTEB ranking point to main text results paragraph
- Supplement now contains only figures, tables, and captions
- Added \clearpage before embedding table to enforce ordering
- Figures appear before tables in all supplement sections

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Used \captionof{figure} instead of figure* floats to force exact placement.
Added caption package. Embedding purity (Fig 9) and confusion (Fig 10) now
appear on page 8, before the embedding table (Table 4) on page 9.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Added old_supplement.tex from main branch for latexdiff
- compile.sh now generates both main and supplement diffs
- Fixed cd issue: cd back to SCRIPT_DIR before compile_diff in 'all' case
- All 5 documents compile: main, supplement, response, diff, diff_supplement

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Added 3 new citations (verified via web search):
- Stamatatos (2009): JASIST 60(3), 538-556 — survey of attribution methods
- Stamatatos (2018): JASIST 69(3), 461-473 — topic masking for cross-topic AA
- Fincke & Boschee (2024): arXiv:2408.05192 — cross-genre data selection

Expanded Discussion cross-domain paragraph with:
- Attribution degrades across genres/topics (Stam09, BarlStam20)
- Topic masking helps cross-topic (Stam18)
- Multi-genre pooling alone doesn't help (FincBosc24)
- Oz analysis is preliminary evidence, not controlled evaluation
- Cross-genre/register evaluation identified as future work

Updated response letter to accurately describe the new discussion.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fixed quote 2: added full sentence ending ("because the only text...")
- Updated section references to match renamed sections:
  "Training data requirements" -> "Corpus size analysis"
  "Comparison with text embedding methods" -> "Comparisons between
   predictive comparison and text embedding matching"
- All 3 inline paper quotes verified verbatim against main.tex

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Grammar:
- "GPT-2 implicitly learn" -> "learns" (subject-verb agreement)
Redundancy:
- Removed duplicate "characterize the data requirements" phrase

Also added .specify/ and specs/ to .gitignore.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Line 641: comma splice fixed with semicolons (Blogs50; CCAT50; Guardian)
- Line 652: "These would not" -> "These analyses would not" (clarify antecedent)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Page numbers remain yellow-highlighted for easy updating if pages shift.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Parquet file was 140MB (over GitHub's 100MB limit). Re-serialized pkl.gz
from parquet using current numpy 1.x (98MB, under limit). The original
pkl.gz required numpy 2.x; this new one is compatible with numpy 1.x.

Updated all references across 15 files (Python, shell, README, gitignore).
All 8 tests pass with the new pkl.gz.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jeremymanning jeremymanning merged commit 1411b89 into main Mar 28, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

address reviewer comments and revise paper

2 participants