OSF Preprints - Modular Pipeline (No Celery)

This repository runs a bounded, stage-based pipeline for OSF preprints using DynamoDB as the single source of truth.

Pipeline stages:

sync: ingest preprints from OSF
pdf: download/convert primary files
grobid: generate TEI from PDFs
extract: parse TEI and write references
enrich: fill missing reference DOIs
flora: FLoRA lookup + screening
author: author/email candidate extraction

All stages run as normal Python commands and exit. Scheduling is external (cron or GitHub Actions). The flora stage checks whether originals have replications cited in the FLoRA database (the FORRT Library of Replication Attempts).

Quick Start (Local)

Create a virtual environment and install Python dependencies:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Install LibreOffice (soffice) locally if you need DOCX -> PDF conversion in the pdf stage.
Configure .env:

cp .env.example .env

Review committed runtime rules in config/runtime.toml (for example ingest.anchor_date and FLORA endpoint).
Start local infrastructure services (optional if you use AWS DynamoDB and/or a remote GROBID):

docker compose up -d dynamodb-local grobid

Initialize DynamoDB tables:

python -c "from osf_sync.db import init_db; init_db(); print('Dynamo tables ready')"

Run pipeline stages:

python -m osf_sync.pipeline run --stage sync --limit 1000
python -m osf_sync.pipeline run --stage pdf --limit 100
python -m osf_sync.pipeline run --stage grobid --limit 50
python -m osf_sync.pipeline run --stage extract --limit 200
python -m osf_sync.pipeline run --stage enrich --limit 300
python -m osf_sync.pipeline run --stage flora --limit-lookup 200 --limit-screen 500

Main Commands

Single stage:

python -m osf_sync.pipeline run --stage <sync|pdf|grobid|extract|enrich|flora|author> [options]

Full bounded run:

python -m osf_sync.pipeline run-all \
  --sync-limit 1000 --pdf-limit 100 --grobid-limit 50 --extract-limit 200 --enrich-limit 300

run-all includes the author stage by default; use --skip-author to disable it for a run. By default, run-all keeps local PDF/TEI files during author; use --cleanup-author-files to allow cleanup. By default, author updates DynamoDB only (no local CSV output). Use --write-debug-csv (and optionally --out) for local debug snapshots.

Ad-hoc window sync:

python -m osf_sync.pipeline sync-from-date --start 2025-07-01

One-off preprint:

python -m osf_sync.pipeline fetch-one --id <OSF_ID>
# or
python -m osf_sync.pipeline fetch-one --doi <DOI_OR_URL>

Author-cluster randomisation (standalone, not in run-all):

python -m osf_sync.pipeline author-randomize \
  --network-state-key trial:author_network_state

Optionally add --authors-csv <path> to use an enriched author CSV if available. Status: this workflow is not yet validated end-to-end in production and should be treated as experimental. This command processes only unassigned preprints. If no prior trial network exists, it initializes one from those preprints; otherwise it loads the latest network from DynamoDB and augments it. Allocations, graph state, and run metadata are stored in DynamoDB trial tables plus sync_state. Use --dry-run to preview candidate processing and allocation counts without writing to DynamoDB:

python -m osf_sync.pipeline author-randomize --dry-run

python -m osf_sync.cli ... is now a thin alias to the same pipeline CLI.

Common Options

--limit: max items for the stage.
--max-seconds: stop the stage after N seconds.
--dry-run: compute/select work without executing mutations.
--debug: enable verbose logging.
--owner and --lease-seconds (queue stages): override DynamoDB claim ownership/lease duration.
--skip-author (run-all): skip author extraction when needed.
--cleanup-author-files (run-all): allow author stage file deletion (off by default).
--write-debug-csv (author stage): write a local debug CSV snapshot (--out overrides the default path).

Environment (`.env`)

# local Docker GROBID:
GROBID_URL=http://localhost:8070
# remote GROBID example:
# GROBID_URL=https://grobid.example.org
GROBID_INCLUDE_RAW_CITATIONS=true
DYNAMO_LOCAL_URL=http://localhost:8000
AWS_REGION=eu-north-1
AWS_SECRET_ACCESS_KEY=<AWS_SECRET_ACCESS_KEY>
AWS_ACCESS_KEY_ID=<AWS_ACCESS_KEY_ID>
DDB_TABLE_PREPRINTS=preprints
DDB_TABLE_REFERENCES=preprint_references
DDB_TABLE_EXCLUDED_PREPRINTS=excluded_preprints
DDB_TABLE_SYNCSTATE=sync_state
DDB_TABLE_API_CACHE=api_cache
OPENALEX_EMAIL=<PERSONAL_EMAIL_ID>
PDF_DEST_ROOT=./data/preprints
LOG_LEVEL=INFO
OSF_INGEST_SKIP_EXISTING=false
API_CACHE_TTL_MONTHS=6
FLORA_CSV_PATH=./data/flora.csv
PIPELINE_CLAIM_LEASE_SECONDS=1800

Runtime Rules (`config/runtime.toml`)

These non-secret operational rules are committed in git:

[ingest]
anchor_date = "2026-02-20" # ISO date/timestamp; empty disables date-window filter
window_months = 6

[flora]
original_lookup_url = "https://rep-api.forrt.org/v1/original-lookup"
cache_ttl_hours = 48

Scheduling

Use either:

Cron/systemd timers on a VM, or
GitHub Actions schedule workflows.

Recommended pattern:

Run each stage independently on a cadence with bounded limits.
Allow overlap; claim/lease fields in DynamoDB prevent duplicate processing.

DynamoDB Queue Flow

sync sets queue_pdf=pending when eligible.
pdf marks queue_pdf=done, queue_grobid=pending.
grobid marks queue_grobid=done, queue_extract=pending.
extract marks queue_extract=done.

Queue stages use claim/lease metadata (claim_*_owner, claim_*_until) and error tracking fields (last_error_*, retry_count_*).

DOI Experiment Command

Use the module entrypoint directly for DOI matching experiments:

python -m osf_sync.augmentation.doi_multi_method_lookup --from-db --limit 400 --output doi_multi_method.csv

Name		Name	Last commit message	Last commit date
Latest commit History 154 Commits
.github		.github
config		config
data		data
grobid-home/config		grobid-home/config
osf_sync		osf_sync
tests		tests
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
docker-compose.yml		docker-compose.yml
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OSF Preprints - Modular Pipeline (No Celery)

Quick Start (Local)

Main Commands

Common Options

Environment (`.env`)

Runtime Rules (`config/runtime.toml`)

Scheduling

DynamoDB Queue Flow

DOI Experiment Command

About

Uh oh!

Releases

Packages

Contributors 9

Uh oh!

Languages

forrtproject/fred_preprint_bot

Folders and files

Latest commit

History

Repository files navigation

OSF Preprints - Modular Pipeline (No Celery)

Quick Start (Local)

Main Commands

Common Options

Environment (.env)

Runtime Rules (config/runtime.toml)

Scheduling

DynamoDB Queue Flow

DOI Experiment Command

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 9

Uh oh!

Languages

Environment (`.env`)

Runtime Rules (`config/runtime.toml`)

Packages