This repository runs a bounded, stage-based pipeline for OSF preprints using DynamoDB as the single source of truth.
Pipeline stages:
sync: ingest preprints from OSFpdf: download/convert primary filesgrobid: generate TEI from PDFsextract: parse TEI and write referencesenrich: fill missing reference DOIsflora: FLoRA lookup + screeningauthor: author/email candidate extraction
All stages run as normal Python commands and exit. Scheduling is external (cron or GitHub Actions).
The flora stage checks whether originals have replications cited in the FLoRA database (the FORRT Library of Replication Attempts).
- Create a virtual environment and install Python dependencies:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt- Install LibreOffice (
soffice) locally if you need DOCX -> PDF conversion in thepdfstage. - Configure
.env:
cp .env.example .env- Review committed runtime rules in
config/runtime.toml(for exampleingest.anchor_dateand FLORA endpoint). - Start local infrastructure services (optional if you use AWS DynamoDB and/or a remote GROBID):
docker compose up -d dynamodb-local grobid- Initialize DynamoDB tables:
python -c "from osf_sync.db import init_db; init_db(); print('Dynamo tables ready')"- Run pipeline stages:
python -m osf_sync.pipeline run --stage sync --limit 1000
python -m osf_sync.pipeline run --stage pdf --limit 100
python -m osf_sync.pipeline run --stage grobid --limit 50
python -m osf_sync.pipeline run --stage extract --limit 200
python -m osf_sync.pipeline run --stage enrich --limit 300
python -m osf_sync.pipeline run --stage flora --limit-lookup 200 --limit-screen 500Single stage:
python -m osf_sync.pipeline run --stage <sync|pdf|grobid|extract|enrich|flora|author> [options]Full bounded run:
python -m osf_sync.pipeline run-all \
--sync-limit 1000 --pdf-limit 100 --grobid-limit 50 --extract-limit 200 --enrich-limit 300run-all includes the author stage by default; use --skip-author to disable it for a run.
By default, run-all keeps local PDF/TEI files during author; use --cleanup-author-files to allow cleanup.
By default, author updates DynamoDB only (no local CSV output). Use --write-debug-csv (and optionally --out) for local debug snapshots.
Ad-hoc window sync:
python -m osf_sync.pipeline sync-from-date --start 2025-07-01One-off preprint:
python -m osf_sync.pipeline fetch-one --id <OSF_ID>
# or
python -m osf_sync.pipeline fetch-one --doi <DOI_OR_URL>Author-cluster randomisation (standalone, not in run-all):
python -m osf_sync.pipeline author-randomize \
--network-state-key trial:author_network_stateOptionally add --authors-csv <path> to use an enriched author CSV if available.
Status: this workflow is not yet validated end-to-end in production and should be treated as experimental.
This command processes only unassigned preprints.
If no prior trial network exists, it initializes one from those preprints; otherwise it loads the latest network from DynamoDB and augments it.
Allocations, graph state, and run metadata are stored in DynamoDB trial tables plus sync_state.
Use --dry-run to preview candidate processing and allocation counts without writing to DynamoDB:
python -m osf_sync.pipeline author-randomize --dry-runpython -m osf_sync.cli ... is now a thin alias to the same pipeline CLI.
--limit: max items for the stage.--max-seconds: stop the stage after N seconds.--dry-run: compute/select work without executing mutations.--debug: enable verbose logging.--ownerand--lease-seconds(queue stages): override DynamoDB claim ownership/lease duration.--skip-author(run-all): skip author extraction when needed.--cleanup-author-files(run-all): allow author stage file deletion (off by default).--write-debug-csv(authorstage): write a local debug CSV snapshot (--outoverrides the default path).
# local Docker GROBID:
GROBID_URL=http://localhost:8070
# remote GROBID example:
# GROBID_URL=https://grobid.example.org
GROBID_INCLUDE_RAW_CITATIONS=true
DYNAMO_LOCAL_URL=http://localhost:8000
AWS_REGION=eu-north-1
AWS_SECRET_ACCESS_KEY=<AWS_SECRET_ACCESS_KEY>
AWS_ACCESS_KEY_ID=<AWS_ACCESS_KEY_ID>
DDB_TABLE_PREPRINTS=preprints
DDB_TABLE_REFERENCES=preprint_references
DDB_TABLE_EXCLUDED_PREPRINTS=excluded_preprints
DDB_TABLE_SYNCSTATE=sync_state
DDB_TABLE_API_CACHE=api_cache
OPENALEX_EMAIL=<PERSONAL_EMAIL_ID>
PDF_DEST_ROOT=./data/preprints
LOG_LEVEL=INFO
OSF_INGEST_SKIP_EXISTING=false
API_CACHE_TTL_MONTHS=6
FLORA_CSV_PATH=./data/flora.csv
PIPELINE_CLAIM_LEASE_SECONDS=1800These non-secret operational rules are committed in git:
[ingest]
anchor_date = "2026-02-20" # ISO date/timestamp; empty disables date-window filter
window_months = 6
[flora]
original_lookup_url = "https://rep-api.forrt.org/v1/original-lookup"
cache_ttl_hours = 48Use either:
- Cron/systemd timers on a VM, or
- GitHub Actions
scheduleworkflows.
Recommended pattern:
- Run each stage independently on a cadence with bounded limits.
- Allow overlap; claim/lease fields in DynamoDB prevent duplicate processing.
syncsetsqueue_pdf=pendingwhen eligible.pdfmarksqueue_pdf=done,queue_grobid=pending.grobidmarksqueue_grobid=done,queue_extract=pending.extractmarksqueue_extract=done.
Queue stages use claim/lease metadata (claim_*_owner, claim_*_until) and error tracking fields (last_error_*, retry_count_*).
Use the module entrypoint directly for DOI matching experiments:
python -m osf_sync.augmentation.doi_multi_method_lookup --from-db --limit 400 --output doi_multi_method.csv