From ac7c0aa75ce04fcbe53cf0989b81499f634c6843 Mon Sep 17 00:00:00 2001
From: Jess Rapson <57367002+rapsoj@users.noreply.github.com>
Date: Tue, 19 May 2026 10:59:49 +0100
Subject: [PATCH] Revise README for clarity and updated project details

Refactor README to enhance clarity and structure, including updates to project goals, pipeline status, and repository structure.
---
 README.md | 775 +++++++++++++++++++++++++++++++-----------------------
 1 file changed, 448 insertions(+), 327 deletions(-)

diff --git a/README.md b/README.md
index 6376e50..df4dce3 100644
--- a/README.md
+++ b/README.md
@@ -1,564 +1,685 @@
 # BioScanCast
 
-BioScanCast is an open source pipeline that uses large language models and automated web retrieval to produce forecasts for biosecurity related questions.
+BioScanCast is an open source pipeline for biosecurity forecasting using LLMs and automated web retrieval.
 
-The system gathers information from the internet, filters relevant sources, extracts structured insights, and produces probabilistic forecasts with confidence scores. The project also evaluates model forecasts against human expert forecasts.
+The system retrieves internet sources, filters relevant documents, extracts structured information, and is intended to support probabilistic forecasting and evaluation against human forecasters.
 
-This repository contains the full pipeline implementation, benchmarking framework, and tooling required to reproduce experiments.
+Current repository contents include:
+
+* modular pipeline stages
+* shared schemas and LLM abstractions
+* retrieval and extraction tooling
+* benchmarking and evaluation infrastructure
+* smoke-test and operational scripts
+
+The forecasting stage is not yet implemented.
 
 ---
 
 # Project Goals
 
 1. Build an open source forecasting system for biosecurity questions.
-2. Benchmark model forecasts against human expert forecasters.
-3. Provide a fully reproducible research pipeline suitable for publication.
-4. Produce accessible outputs including technical documentation and public explanations.
+2. Benchmark model forecasts against human forecasters.
+3. Provide a reproducible research pipeline suitable for publication.
+4. Produce accessible technical and public-facing outputs.
 
 ---
 
-# High Level Pipeline
-
-The system follows a modular pipeline with five stages.
+# Pipeline Status
 
-1. Search stage
-   Collect candidate sources from the internet.
+Implemented:
 
-2. Filtering stage
-   Identify credible and relevant sources.
+```text
+Search -> Filtering -> Extraction -> Insight
+```
 
-3. Extraction stage
-   Retrieve and clean content from selected sources.
+Planned:
 
-4. Insight stage
-   Extract structured information such as events and timelines.
+```text
+Forecasting
+```
 
-5. Forecasting stage
-   Use structured information to generate forecasts and confidence scores.
+Current capabilities include:
 
-Each stage is implemented as an independent module so developers can work on components without affecting the rest of the system.
+* LLM query decomposition
+* web/news retrieval via Tavily
+* heuristic + optional LLM filtering
+* HTML/PDF extraction and chunking
+* hybrid BM25 + embedding retrieval
+* structured fact extraction with provenance tracking
 
 ---
 
-# Repository Structure
+# Pipeline Overview
 
-```
-bioscancast/
-│
-├── bioscancast/
-│   ├── pipeline/
-│   ├── stages/
-│   ├── schemas/
-│   ├── llm/
-│   ├── retrieval/
-│   ├── evaluation/
-│   ├── datasets/
-│   └── utils/
-│
-├── configs/
-├── data/
-├── scripts/
-├── notebooks/
-├── tests/
-└── docs/
-```
+## 1. Search Stage
 
-The sections below describe the purpose of each directory.
+Collect candidate internet sources.
 
----
+Features:
 
-# Core Package
+* LLM query decomposition
+* Tavily retrieval backend
+* source tier scoring
+* dashboard injection
+* URL normalization + deduplication
 
-The `bioscancast/` directory contains the main application code.
+Output:
 
+```python
+List[SearchResult]
 ```
-bioscancast/
-```
-
-This package implements the forecasting pipeline and supporting modules.
 
 ---
 
-# Pipeline
-
-```
-bioscancast/pipeline/
-```
+## 2. Filtering Stage
 
-Responsible for coordinating execution across stages.
-
-Files:
+Identify credible and relevant sources.
 
-**orchestrator.py**
-Controls pipeline flow. Calls each stage sequentially.
+Features:
 
-**pipeline_runner.py**
-Entry point for running the full pipeline.
+* heuristic relevance scoring
+* source credibility scoring
+* duplicate removal
+* optional LLM review
+* extraction-priority assignment
 
-**pipeline_types.py**
-Shared data types used to pass outputs between stages.
+Output:
 
-Developers modifying pipeline order or execution logic should work here.
+```python
+List[FilteredDocument]
+```
 
 ---
 
-# Pipeline Stages
+## 3. Extraction Stage
 
-```
-bioscancast/stages/
-```
+Fetch and normalize source content.
 
-Each stage of the forecasting pipeline is implemented in its own folder.
+Features:
 
-Stages should remain independent and communicate only through defined schemas.
+* HTML/PDF fetching
+* HTML/PDF parsing
+* table extraction
+* chunk normalization
+* metadata extraction
 
----
-
-## Search Stage
+Output:
 
-```
-bioscancast/stages/search_stage/
+```python
+List[Document]
 ```
 
-Purpose:
+---
 
-Generate candidate sources relevant to a forecasting question.
+## 4. Insight Stage
 
-Tasks:
+Convert extracted text into structured facts.
 
-• LLM-based query decomposition (5-8 sub-queries per question)
-• Search execution via Tavily API (swappable backend)
-• Dashboard injection for known pathogens (CDC, WHO, etc.)
-• URL normalization and deduplication
-• Source tier scoring and aggregator flagging
+Features:
 
-Outputs:
+* BM25 retrieval
+* embedding retrieval
+* hybrid reranking
+* structured fact extraction
+* provenance tracking
+* hallucination filtering
+* cross-document deduplication
 
-```
-List[SearchResult]
-```
+Output:
 
-Known limitations:
+```python
+List[InsightRecord]
+```
 
-• English-only for v1. Francophone Africa, Lusophone, and Spanish-speaking
-  regions are systematically under-covered. This must be addressed before
-  live Fall deployment if time allows.
-• Dashboard lookup is hardcoded (v1) and will need iteration after the first
-  benchmark run.
+Design principle:
 
----
+```text
+one chunk -> one extraction call
+```
 
-## Filtering Stage
+Each fact must include:
 
-```
-bioscancast/stages/filtering_stage/
-```
+* supporting quote
+* source chunk
+* source URL
 
-Purpose:
+Facts failing substring verification are discarded.
 
-Identify credible and relevant sources.
+Current limitations:
 
-Expected tasks:
+* disconnected from extraction outputs in smoke tests
+* no temporal reasoning layer
+* no forecasting integration
 
-• LLM relevance classification
-• source credibility checks
-• removal of duplicate or low quality URLs
+---
 
-Expected outputs:
+## 5. Forecasting Stage
 
-```
-List[FilteredURL]
-```
+Planned but not implemented.
 
-Example modules:
+Intended responsibilities:
 
-url_ranker.py
-source_validator.py
-relevance_model.py
+* probabilistic forecasting
+* calibration
+* confidence estimation
+* forecast evaluation
 
 ---
 
-## Extraction Stage
+# Repository Structure
 
-```
-bioscancast/stages/extraction_stage/
+```text
+bioscancast/
+├── bioscancast/
+│   ├── datasets/
+│   ├── extraction/
+│   ├── filtering/
+│   ├── insight/
+│   ├── llm/
+│   ├── schemas/
+│   ├── stages/
+│   │   ├── search_stage/
+│   │   └── eval_stage/
+│   └── tests/
+├── configs/
+├── data/
+├── docs/
+├── notebooks/
+├── scripts/
+├── tests/
+├── pyproject.toml
+└── requirements.txt
 ```
 
-Purpose:
+---
 
-Retrieve and normalize content from selected sources.
+# Core Modules
 
-Expected tasks:
+| Module                 | Purpose                                    |
+| ---------------------- | ------------------------------------------ |
+| `datasets/`            | curated source registries and source tiers |
+| `extraction/`          | fetching, parsing, chunking                |
+| `filtering/`           | source filtering and ranking               |
+| `insight/`             | retrieval and fact extraction              |
+| `llm/`                 | model abstractions                         |
+| `schemas/`             | shared structured contracts                |
+| `stages/search_stage/` | retrieval stage                            |
+| `stages/eval_stage/`   | evaluation tooling                         |
 
-• scrape HTML pages
-• download PDFs
-• parse documents
-• clean text
+---
 
-Expected outputs:
+# Stage Details
 
-```
-List[Document]
+## Search Stage
+
+```text
+bioscancast/stages/search_stage/
 ```
 
-Example modules:
+Implemented modules:
 
-scraper.py
-html_parser.py
-pdf_parser.py
-text_cleaner.py
+| File                         | Purpose                    |
+| ---------------------------- | -------------------------- |
+| `pipeline.py`                | orchestration              |
+| `query_decomposition.py`     | LLM sub-query generation   |
+| `tier_resolution.py`         | source credibility scoring |
+| `dashboard_lookup.py`        | dashboard injection        |
+| `url_normalization.py`       | canonicalization + dedup   |
+| `backends/tavily_backend.py` | Tavily backend             |
 
-Note on PDF table extraction (Docling refiner):
+Current features:
 
-The extraction stage uses an in-tree PDF parser (PyMuPDF + pdfplumber) as the
-default and a Docling TableFormer post-pass to refine table sections when an
-in-tree result looks broken or when the source URL is on a curated allowlist
-of publishers whose tables are known to be hard (CDC MMWR, certain WHO
-situation reports).
+* 5–8 LLM-generated subqueries
+* backend abstraction via `SearchBackend`
+* source tier + freshness scoring
+* aggregator-domain flagging
+* non-content URL filtering
 
-The first PDF that triggers the refiner downloads the Docling layout and
-TableFormer models (~40 MB) to the HuggingFace cache (`~/.cache/huggingface/`)
-and holds them in memory (~1.5 GB) for the lifetime of the process. The
-feature is toggled with `ExtractionConfig.enable_docling_refiner` — when
-disabled, no Docling imports occur and behaviour matches the pre-refiner
-pipeline exactly.
+Known limitations:
+
+* English-only retrieval
+* hardcoded dashboard mappings
+* no multilingual retrieval
 
 ---
 
-## Insight Stage
+## Filtering Stage
 
-```
-bioscancast/stages/insight_stage/
+```text
+bioscancast/filtering/
 ```
 
-Purpose:
+Implemented modules:
 
-Extract structured information from text.
+| File               | Purpose                        |
+| ------------------ | ------------------------------ |
+| `pipeline.py`      | orchestration                  |
+| `heuristics.py`    | heuristic scoring              |
+| `llm_filter.py`    | LLM adjudication               |
+| `reranker.py`      | borderline reranking           |
+| `deduplication.py` | duplicate handling             |
+| `postprocess.py`   | extraction-priority assignment |
 
-Expected tasks:
+Current features:
 
-• event extraction
-• timeline construction
-• key insight identification
+* heuristic relevance scoring
+* source credibility scoring
+* optional LLM review
+* domain caps
+* extraction-mode assignment
 
-Expected outputs:
+---
 
-```
-DataFrame[InsightRecord]
+## Extraction Stage
+
+```text
+bioscancast/extraction/
 ```
 
-Example modules:
+Implemented modules:
 
-information_extractor.py
-event_parser.py
-timeline_builder.py
-dataframe_builder.py
+| File                     | Purpose                   |
+| ------------------------ | ------------------------- |
+| `pipeline.py`            | orchestration             |
+| `fetcher.py`             | network retrieval         |
+| `chunking.py`            | chunk normalization       |
+| `parsers/html_parser.py` | HTML extraction           |
+| `parsers/pdf_parser.py`  | PDF extraction            |
+| `docling_refiner.py`     | optional table refinement |
 
----
+Current features:
 
-## Forecasting Stage
+* browser-fingerprinted fetching via `curl_cffi`
+* BeautifulSoup + trafilatura HTML parsing
+* PyMuPDF PDF parsing
+* pdfplumber table fallback
+* chunk normalization
+* metadata extraction
+* document-level provenance tracking
 
-```
-bioscancast/stages/forecasting_stage/
-```
+### PDF Table Extraction (Docling Refiner)
 
-Purpose:
+The default PDF pipeline uses PyMuPDF + pdfplumber with an optional Docling TableFormer refinement pass.
 
-Produce probabilistic forecasts based on extracted insights.
+The first refinement run downloads Docling models (~40 MB) into:
 
-Expected tasks:
+```text
+~/.cache/huggingface/
+```
 
-• generate model prompts
-• apply reasoning models
-• calibrate probabilities
-• produce confidence scores
+Models remain resident in memory (~1.5 GB) for the process lifetime.
 
-Expected outputs:
+Controlled via:
 
+```python
+ExtractionConfig.enable_docling_refiner
 ```
-ForecastOutput
-```
 
-Example modules:
+When disabled, no Docling imports occur.
+
+Current limitations:
 
-forecaster.py
-prompt_templates.py
-confidence_calibration.py
+* OCR not implemented
+* scanned PDFs return `requires_ocr`
+* no persistent document store
+* extraction is currently in-memory only
 
 ---
 
-# Schemas
+## Insight Stage
 
-```
-bioscancast/schemas/
+```text
+bioscancast/insight/
 ```
 
-Defines structured data objects shared between pipeline stages.
+Implemented modules:
 
-Examples:
+| File                            | Purpose             |
+| ------------------------------- | ------------------- |
+| `pipeline.py`                   | orchestration       |
+| `retrieval/bm25.py`             | lexical retrieval   |
+| `retrieval/embeddings.py`       | embedding retrieval |
+| `retrieval/hybrid.py`           | hybrid reranking    |
+| `extraction/chunk_extractor.py` | fact extraction     |
 
-search_result.py
-document.py
-extracted_event.py
-forecast_output.py
+Current features:
 
-All stages should communicate using these schemas. Do not pass raw dictionaries between stages.
+* BM25 retrieval
+* embedding similarity retrieval
+* hybrid scoring
+* keyword reranking
+* chunk-level extraction
+* quote-based hallucination guards
+* provenance linking
+* cross-document deduplication
 
 ---
 
-# LLM Integration
+# Evaluation
 
+```text
+bioscancast/stages/eval_stage/
 ```
-bioscancast/llm/
-```
-
-Provides abstraction layers for language models.
 
-Supported providers may include:
+Implemented modules:
 
-• OpenAI
-• Anthropic
-• Local models
+| File               | Purpose                   |
+| ------------------ | ------------------------- |
+| `evaluator.py`     | orchestration             |
+| `scoring.py`       | forecast scoring          |
+| `calibration.py`   | calibration metrics       |
+| `compare.py`       | model vs human comparison |
+| `visualisation.py` | plots and reporting       |
 
-Example files:
+Repository datasets:
 
-llm_client.py
-openai_client.py
-anthropic_client.py
-
-Stages should call these clients rather than directly interacting with APIs.
+```text
+bioscancast_forecasts.csv
+bioscancast_questions.csv
+```
 
 ---
 
-# Retrieval Utilities
+# Schemas
 
+```text
+bioscancast/schemas/
 ```
-bioscancast/retrieval/
-```
 
-Tools for document retrieval and embedding.
+Shared stage contracts.
+
+Key schemas:
+
+| File                | Purpose                      |
+| ------------------- | ---------------------------- |
+| `document.py`       | extracted documents + chunks |
+| `insight_record.py` | extracted facts              |
 
-Examples:
+Additional filtering models live in:
 
-document_store.py
-embedding_model.py
-chunking.py
+```text
+bioscancast/filtering/models.py
+```
+
+including:
+
+* `ForecastQuestion`
+* `SearchResult`
+* `FilteredDocument`
 
-Used by extraction and insight stages.
+Stages should communicate through schemas rather than raw dictionaries.
 
 ---
 
-# Evaluation
+# LLM Integration
 
+```text
+bioscancast/llm/
 ```
-bioscancast/evaluation/
-```
 
-Contains benchmarking and evaluation logic.
+Current files:
+
+| File               | Purpose                            |
+| ------------------ | ---------------------------------- |
+| `base.py`          | shared protocol + token accounting |
+| `client.py`        | legacy/simple OpenAI wrapper       |
+| `openai_client.py` | structured extraction client       |
+| `fake_client.py`   | testing client                     |
 
-Examples:
+The repository currently contains two partially overlapping interfaces:
 
-benchmark_loader.py
-scoring.py
-brier_score.py
-calibration_metrics.py
-human_comparison.py
+```text
+bioscancast/llm/base.py
+bioscancast/llm/client.py
+```
 
-Used to compare model forecasts against human forecasts.
+These should eventually be unified.
 
 ---
 
 # Datasets
 
-```
+```text
 bioscancast/datasets/
 ```
 
-Contains definitions for forecasting datasets and curated source lists.
-
-Examples:
+Curated source definitions and credibility tiers.
 
-forecast_questions.py
-biosecurity_sources.py
+| File                     | Purpose                  |
+| ------------------------ | ------------------------ |
+| `biosecurity_sources.py` | curated source registry  |
+| `source_tiers.py`        | source credibility tiers |
 
 ---
 
-# Utilities
+# Scripts
 
-```
-bioscancast/utils/
+```text
+scripts/
 ```
 
-General purpose helpers used throughout the codebase.
+Operational and smoke-test utilities.
 
-Examples:
+| Script                | Purpose                     |
+| --------------------- | --------------------------- |
+| `run_search_stage.py` | run search stage            |
+| `run_filtering.py`    | run filtering stage         |
+| `run_extraction.py`   | run extraction stage        |
+| `run_insight.py`      | run insight smoke test      |
+| `eval_docling.py`     | Docling evaluation          |
+| `eval_hybrid_pdf.py`  | PDF extraction benchmarking |
 
-logging utilities
-configuration loading
-rate limiting
-caching
+Scripts are intended for operational workflows rather than reusable library APIs.
 
 ---
 
-# Configurations
-
-```
-configs/
-```
+# Running the Pipeline
 
-Configuration files for models, scraping behavior, and pipeline parameters.
+## Environment Setup
 
-Examples:
+```bash
+python -m venv .venv
+source .venv/bin/activate
+pip install -r requirements.txt
+```
 
-model configuration
-API settings
-scraping limits
-LLM prompt settings
+Additional packages:
 
-These files allow experimentation without modifying code.
+```bash
+pip install openai tavily-python python-dotenv
+```
 
 ---
 
-# Data
+## Environment Variables
 
-```
-data/
+Create `.env`:
+
+```bash
+OPENAI_API_KEY=sk-...
+TAVILY_API_KEY=tvly-...
 ```
 
-Stores intermediate and benchmark data.
+---
 
-Subdirectories:
+## Search Stage
 
-raw
-original scraped data
+```bash
+python scripts/run_search_stage.py \
+  "Will H5N1 cause more than 100 human cases in the US by December 2026?" \
+  --pathogen h5n1 \
+  --region "United States"
+```
 
-processed
-cleaned datasets
+Optional JSON output:
 
-benchmarks
-forecasting evaluation datasets
+```bash
+python scripts/run_search_stage.py \
+  "How many mpox cases will be reported globally by June 2026?" \
+  --pathogen mpox \
+  --output data/search_results.json
+```
 
 ---
 
-# Scripts
+## Filtering Stage
 
-```
-scripts/
+```bash
+python scripts/run_filtering.py
 ```
 
-Command line tools used to run experiments and pipelines.
+Current limitation:
 
-Examples:
+* uses hardcoded sample inputs rather than automatic search-stage ingestion
 
-run_pipeline.py
-Runs the full forecasting pipeline.
+Output:
 
-run_benchmark.py
-Evaluates model forecasts against benchmark datasets.
+```text
+data/filtered_results.json
+```
 
-scrape_sources.py
-Bulk scraping utility for collecting documents.
+---
 
-evaluate_forecasts.py
-Computes evaluation metrics.
+## Extraction Stage
 
-Scripts are intended for operational tasks rather than reusable code.
+Smoke-test mode:
 
----
+```bash
+python scripts/run_extraction.py
+```
 
-# Notebooks
+Using filtered-document JSON:
 
+```bash
+python scripts/run_extraction.py \
+  --input data/filtered_results.json
 ```
-notebooks/
+
+Output:
+
+```text
+data/extraction_results.json
 ```
 
-Used for exploratory analysis and experimentation.
+---
 
-Examples:
+## Insight Stage
 
-model experiments
-prompt exploration
-benchmark analysis
+```bash
+python scripts/run_insight.py
+```
 
-Notebook code should not be required for the main pipeline.
+Current limitation:
+
+* uses synthetic documents rather than extraction outputs
 
 ---
 
-# Tests
+# Closest Current End-to-End Flow
 
-```
-tests/
+```bash
+python scripts/run_search_stage.py \
+  "Will mpox cases increase in Uganda in 2026?" \
+  --pathogen mpox \
+  --region Uganda \
+  --output data/search_results.json && \
+python scripts/run_filtering.py && \
+python scripts/run_extraction.py \
+  --input data/filtered_results.json
 ```
 
-Unit and integration tests for pipeline components.
+`bioscancast/main.py` currently contains pseudocode only and is not yet a runnable orchestrator.
 
-Examples:
+---
 
-stage level tests
-pipeline execution tests
-schema validation tests
+# Tests
 
-Each pipeline stage should include test coverage.
+```text
+bioscancast/tests/
+```
 
----
+Includes:
 
-# Documentation
+* extraction tests
+* retrieval tests
+* pipeline tests
+* schema validation
+* search-stage integration tests
 
-```
-docs/
+Run all tests:
+
+```bash
+pytest
 ```
 
-Project documentation and architecture notes.
+Run selected tests:
 
-Examples:
+```bash
+pytest bioscancast/tests/test_extraction_pipeline.py
+pytest bioscancast/tests/test_insight_pipeline.py
+```
 
-system architecture
-pipeline design
-benchmark methodology
-API documentation
+Live fetch tests are marked:
 
-These documents support research publication and developer onboarding.
+```python
+@pytest.mark.live
+```
 
----
+and skipped by default.
 
-# Development Principles
+Run with:
 
-1. Pipeline stages must remain modular.
-2. Data passed between stages must use schemas.
-3. Stages should not import logic from other stages.
-4. Experimental code should live in notebooks or scripts.
-5. Reproducibility is a core requirement.
+```bash
+pytest --live
+```
 
 ---
 
 # Dependencies
 
-Python dependencies are listed in `requirements.txt`. A few notes worth
-flagging for contributors working on the extraction stage:
+Important dependencies:
 
-- **curl_cffi** is used for all HTTP fetches in `bioscancast/extraction/fetcher.py`. It wraps libcurl with browser TLS-fingerprint impersonation (Chrome by default), which lets us reach Cloudflare-fronted news sources (e.g. Reuters) that reject the default Python TLS stack used by `httpx`/`requests` with HTTP 401/403. The impersonation profile is configurable via `ExtractionConfig.impersonate`. See [issue #18](https://github.com/algorithmicgovernance/BioScanCast/issues/18) for background.
-- **Live network tests** are marked `@pytest.mark.live` and skipped by default. Run them with `pytest --live` to verify the fetcher still works against real CDN-fronted endpoints.
+| Dependency   | Usage                               |
+| ------------ | ----------------------------------- |
+| `curl_cffi`  | browser-fingerprinted HTTP fetching |
+| `rank_bm25`  | lexical retrieval                   |
+| `PyMuPDF`    | primary PDF parsing                 |
+| `pdfplumber` | fallback PDF table extraction       |
 
----
+`curl_cffi` is used in:
 
-# Running the Pipeline
+```text
+bioscancast/extraction/fetcher.py
+```
 
-Example:
+The impersonation profile is configurable via:
 
-```
-python scripts/run_pipeline.py
+```python
+ExtractionConfig.impersonate
 ```
 
-Example benchmark run:
+---
 
-```
-python scripts/run_benchmark.py
-```
+# Development Principles
+
+1. Keep pipeline stages modular.
+2. Use schemas between stages.
+3. Prefer structured interfaces over raw dictionaries.
+4. Keep experimental workflows in scripts or notebooks.
+5. Prioritize reproducibility.
+6. Treat provenance and auditability as first-class concerns.
 
 ---
 
-# Contributing
+# Known Architectural Gaps
 
-Developers should work within a single pipeline stage whenever possible. Changes that affect data contracts or schemas should be discussed before merging.
+Major missing components:
 
-All contributions should include tests.
\ No newline at end of file
+1. unified end-to-end orchestrator
+2. extraction → insight integration
+3. OCR support
+4. forecasting stage implementation
+5. persistent storage/vector DB layer
+6. unified LLM abstraction
+7. multilingual retrieval