From ac7c0aa75ce04fcbe53cf0989b81499f634c6843 Mon Sep 17 00:00:00 2001 From: Jess Rapson <57367002+rapsoj@users.noreply.github.com> Date: Tue, 19 May 2026 10:59:49 +0100 Subject: [PATCH] Revise README for clarity and updated project details Refactor README to enhance clarity and structure, including updates to project goals, pipeline status, and repository structure. --- README.md | 775 +++++++++++++++++++++++++++++++----------------------- 1 file changed, 448 insertions(+), 327 deletions(-) diff --git a/README.md b/README.md index 6376e50..df4dce3 100644 --- a/README.md +++ b/README.md @@ -1,564 +1,685 @@ # BioScanCast -BioScanCast is an open source pipeline that uses large language models and automated web retrieval to produce forecasts for biosecurity related questions. +BioScanCast is an open source pipeline for biosecurity forecasting using LLMs and automated web retrieval. -The system gathers information from the internet, filters relevant sources, extracts structured insights, and produces probabilistic forecasts with confidence scores. The project also evaluates model forecasts against human expert forecasts. +The system retrieves internet sources, filters relevant documents, extracts structured information, and is intended to support probabilistic forecasting and evaluation against human forecasters. -This repository contains the full pipeline implementation, benchmarking framework, and tooling required to reproduce experiments. +Current repository contents include: + +* modular pipeline stages +* shared schemas and LLM abstractions +* retrieval and extraction tooling +* benchmarking and evaluation infrastructure +* smoke-test and operational scripts + +The forecasting stage is not yet implemented. --- # Project Goals 1. Build an open source forecasting system for biosecurity questions. -2. Benchmark model forecasts against human expert forecasters. -3. Provide a fully reproducible research pipeline suitable for publication. -4. Produce accessible outputs including technical documentation and public explanations. +2. Benchmark model forecasts against human forecasters. +3. Provide a reproducible research pipeline suitable for publication. +4. Produce accessible technical and public-facing outputs. --- -# High Level Pipeline - -The system follows a modular pipeline with five stages. +# Pipeline Status -1. Search stage - Collect candidate sources from the internet. +Implemented: -2. Filtering stage - Identify credible and relevant sources. +```text +Search -> Filtering -> Extraction -> Insight +``` -3. Extraction stage - Retrieve and clean content from selected sources. +Planned: -4. Insight stage - Extract structured information such as events and timelines. +```text +Forecasting +``` -5. Forecasting stage - Use structured information to generate forecasts and confidence scores. +Current capabilities include: -Each stage is implemented as an independent module so developers can work on components without affecting the rest of the system. +* LLM query decomposition +* web/news retrieval via Tavily +* heuristic + optional LLM filtering +* HTML/PDF extraction and chunking +* hybrid BM25 + embedding retrieval +* structured fact extraction with provenance tracking --- -# Repository Structure +# Pipeline Overview -``` -bioscancast/ -│ -├── bioscancast/ -│ ├── pipeline/ -│ ├── stages/ -│ ├── schemas/ -│ ├── llm/ -│ ├── retrieval/ -│ ├── evaluation/ -│ ├── datasets/ -│ └── utils/ -│ -├── configs/ -├── data/ -├── scripts/ -├── notebooks/ -├── tests/ -└── docs/ -``` +## 1. Search Stage -The sections below describe the purpose of each directory. +Collect candidate internet sources. ---- +Features: -# Core Package +* LLM query decomposition +* Tavily retrieval backend +* source tier scoring +* dashboard injection +* URL normalization + deduplication -The `bioscancast/` directory contains the main application code. +Output: +```python +List[SearchResult] ``` -bioscancast/ -``` - -This package implements the forecasting pipeline and supporting modules. --- -# Pipeline - -``` -bioscancast/pipeline/ -``` +## 2. Filtering Stage -Responsible for coordinating execution across stages. - -Files: +Identify credible and relevant sources. -**orchestrator.py** -Controls pipeline flow. Calls each stage sequentially. +Features: -**pipeline_runner.py** -Entry point for running the full pipeline. +* heuristic relevance scoring +* source credibility scoring +* duplicate removal +* optional LLM review +* extraction-priority assignment -**pipeline_types.py** -Shared data types used to pass outputs between stages. +Output: -Developers modifying pipeline order or execution logic should work here. +```python +List[FilteredDocument] +``` --- -# Pipeline Stages +## 3. Extraction Stage -``` -bioscancast/stages/ -``` +Fetch and normalize source content. -Each stage of the forecasting pipeline is implemented in its own folder. +Features: -Stages should remain independent and communicate only through defined schemas. +* HTML/PDF fetching +* HTML/PDF parsing +* table extraction +* chunk normalization +* metadata extraction ---- - -## Search Stage +Output: -``` -bioscancast/stages/search_stage/ +```python +List[Document] ``` -Purpose: +--- -Generate candidate sources relevant to a forecasting question. +## 4. Insight Stage -Tasks: +Convert extracted text into structured facts. -• LLM-based query decomposition (5-8 sub-queries per question) -• Search execution via Tavily API (swappable backend) -• Dashboard injection for known pathogens (CDC, WHO, etc.) -• URL normalization and deduplication -• Source tier scoring and aggregator flagging +Features: -Outputs: +* BM25 retrieval +* embedding retrieval +* hybrid reranking +* structured fact extraction +* provenance tracking +* hallucination filtering +* cross-document deduplication -``` -List[SearchResult] -``` +Output: -Known limitations: +```python +List[InsightRecord] +``` -• English-only for v1. Francophone Africa, Lusophone, and Spanish-speaking - regions are systematically under-covered. This must be addressed before - live Fall deployment if time allows. -• Dashboard lookup is hardcoded (v1) and will need iteration after the first - benchmark run. +Design principle: ---- +```text +one chunk -> one extraction call +``` -## Filtering Stage +Each fact must include: -``` -bioscancast/stages/filtering_stage/ -``` +* supporting quote +* source chunk +* source URL -Purpose: +Facts failing substring verification are discarded. -Identify credible and relevant sources. +Current limitations: -Expected tasks: +* disconnected from extraction outputs in smoke tests +* no temporal reasoning layer +* no forecasting integration -• LLM relevance classification -• source credibility checks -• removal of duplicate or low quality URLs +--- -Expected outputs: +## 5. Forecasting Stage -``` -List[FilteredURL] -``` +Planned but not implemented. -Example modules: +Intended responsibilities: -url_ranker.py -source_validator.py -relevance_model.py +* probabilistic forecasting +* calibration +* confidence estimation +* forecast evaluation --- -## Extraction Stage +# Repository Structure -``` -bioscancast/stages/extraction_stage/ +```text +bioscancast/ +├── bioscancast/ +│ ├── datasets/ +│ ├── extraction/ +│ ├── filtering/ +│ ├── insight/ +│ ├── llm/ +│ ├── schemas/ +│ ├── stages/ +│ │ ├── search_stage/ +│ │ └── eval_stage/ +│ └── tests/ +├── configs/ +├── data/ +├── docs/ +├── notebooks/ +├── scripts/ +├── tests/ +├── pyproject.toml +└── requirements.txt ``` -Purpose: +--- -Retrieve and normalize content from selected sources. +# Core Modules -Expected tasks: +| Module | Purpose | +| ---------------------- | ------------------------------------------ | +| `datasets/` | curated source registries and source tiers | +| `extraction/` | fetching, parsing, chunking | +| `filtering/` | source filtering and ranking | +| `insight/` | retrieval and fact extraction | +| `llm/` | model abstractions | +| `schemas/` | shared structured contracts | +| `stages/search_stage/` | retrieval stage | +| `stages/eval_stage/` | evaluation tooling | -• scrape HTML pages -• download PDFs -• parse documents -• clean text +--- -Expected outputs: +# Stage Details -``` -List[Document] +## Search Stage + +```text +bioscancast/stages/search_stage/ ``` -Example modules: +Implemented modules: -scraper.py -html_parser.py -pdf_parser.py -text_cleaner.py +| File | Purpose | +| ---------------------------- | -------------------------- | +| `pipeline.py` | orchestration | +| `query_decomposition.py` | LLM sub-query generation | +| `tier_resolution.py` | source credibility scoring | +| `dashboard_lookup.py` | dashboard injection | +| `url_normalization.py` | canonicalization + dedup | +| `backends/tavily_backend.py` | Tavily backend | -Note on PDF table extraction (Docling refiner): +Current features: -The extraction stage uses an in-tree PDF parser (PyMuPDF + pdfplumber) as the -default and a Docling TableFormer post-pass to refine table sections when an -in-tree result looks broken or when the source URL is on a curated allowlist -of publishers whose tables are known to be hard (CDC MMWR, certain WHO -situation reports). +* 5–8 LLM-generated subqueries +* backend abstraction via `SearchBackend` +* source tier + freshness scoring +* aggregator-domain flagging +* non-content URL filtering -The first PDF that triggers the refiner downloads the Docling layout and -TableFormer models (~40 MB) to the HuggingFace cache (`~/.cache/huggingface/`) -and holds them in memory (~1.5 GB) for the lifetime of the process. The -feature is toggled with `ExtractionConfig.enable_docling_refiner` — when -disabled, no Docling imports occur and behaviour matches the pre-refiner -pipeline exactly. +Known limitations: + +* English-only retrieval +* hardcoded dashboard mappings +* no multilingual retrieval --- -## Insight Stage +## Filtering Stage -``` -bioscancast/stages/insight_stage/ +```text +bioscancast/filtering/ ``` -Purpose: +Implemented modules: -Extract structured information from text. +| File | Purpose | +| ------------------ | ------------------------------ | +| `pipeline.py` | orchestration | +| `heuristics.py` | heuristic scoring | +| `llm_filter.py` | LLM adjudication | +| `reranker.py` | borderline reranking | +| `deduplication.py` | duplicate handling | +| `postprocess.py` | extraction-priority assignment | -Expected tasks: +Current features: -• event extraction -• timeline construction -• key insight identification +* heuristic relevance scoring +* source credibility scoring +* optional LLM review +* domain caps +* extraction-mode assignment -Expected outputs: +--- -``` -DataFrame[InsightRecord] +## Extraction Stage + +```text +bioscancast/extraction/ ``` -Example modules: +Implemented modules: -information_extractor.py -event_parser.py -timeline_builder.py -dataframe_builder.py +| File | Purpose | +| ------------------------ | ------------------------- | +| `pipeline.py` | orchestration | +| `fetcher.py` | network retrieval | +| `chunking.py` | chunk normalization | +| `parsers/html_parser.py` | HTML extraction | +| `parsers/pdf_parser.py` | PDF extraction | +| `docling_refiner.py` | optional table refinement | ---- +Current features: -## Forecasting Stage +* browser-fingerprinted fetching via `curl_cffi` +* BeautifulSoup + trafilatura HTML parsing +* PyMuPDF PDF parsing +* pdfplumber table fallback +* chunk normalization +* metadata extraction +* document-level provenance tracking -``` -bioscancast/stages/forecasting_stage/ -``` +### PDF Table Extraction (Docling Refiner) -Purpose: +The default PDF pipeline uses PyMuPDF + pdfplumber with an optional Docling TableFormer refinement pass. -Produce probabilistic forecasts based on extracted insights. +The first refinement run downloads Docling models (~40 MB) into: -Expected tasks: +```text +~/.cache/huggingface/ +``` -• generate model prompts -• apply reasoning models -• calibrate probabilities -• produce confidence scores +Models remain resident in memory (~1.5 GB) for the process lifetime. -Expected outputs: +Controlled via: +```python +ExtractionConfig.enable_docling_refiner ``` -ForecastOutput -``` -Example modules: +When disabled, no Docling imports occur. + +Current limitations: -forecaster.py -prompt_templates.py -confidence_calibration.py +* OCR not implemented +* scanned PDFs return `requires_ocr` +* no persistent document store +* extraction is currently in-memory only --- -# Schemas +## Insight Stage -``` -bioscancast/schemas/ +```text +bioscancast/insight/ ``` -Defines structured data objects shared between pipeline stages. +Implemented modules: -Examples: +| File | Purpose | +| ------------------------------- | ------------------- | +| `pipeline.py` | orchestration | +| `retrieval/bm25.py` | lexical retrieval | +| `retrieval/embeddings.py` | embedding retrieval | +| `retrieval/hybrid.py` | hybrid reranking | +| `extraction/chunk_extractor.py` | fact extraction | -search_result.py -document.py -extracted_event.py -forecast_output.py +Current features: -All stages should communicate using these schemas. Do not pass raw dictionaries between stages. +* BM25 retrieval +* embedding similarity retrieval +* hybrid scoring +* keyword reranking +* chunk-level extraction +* quote-based hallucination guards +* provenance linking +* cross-document deduplication --- -# LLM Integration +# Evaluation +```text +bioscancast/stages/eval_stage/ ``` -bioscancast/llm/ -``` - -Provides abstraction layers for language models. -Supported providers may include: +Implemented modules: -• OpenAI -• Anthropic -• Local models +| File | Purpose | +| ------------------ | ------------------------- | +| `evaluator.py` | orchestration | +| `scoring.py` | forecast scoring | +| `calibration.py` | calibration metrics | +| `compare.py` | model vs human comparison | +| `visualisation.py` | plots and reporting | -Example files: +Repository datasets: -llm_client.py -openai_client.py -anthropic_client.py - -Stages should call these clients rather than directly interacting with APIs. +```text +bioscancast_forecasts.csv +bioscancast_questions.csv +``` --- -# Retrieval Utilities +# Schemas +```text +bioscancast/schemas/ ``` -bioscancast/retrieval/ -``` -Tools for document retrieval and embedding. +Shared stage contracts. + +Key schemas: + +| File | Purpose | +| ------------------- | ---------------------------- | +| `document.py` | extracted documents + chunks | +| `insight_record.py` | extracted facts | -Examples: +Additional filtering models live in: -document_store.py -embedding_model.py -chunking.py +```text +bioscancast/filtering/models.py +``` + +including: + +* `ForecastQuestion` +* `SearchResult` +* `FilteredDocument` -Used by extraction and insight stages. +Stages should communicate through schemas rather than raw dictionaries. --- -# Evaluation +# LLM Integration +```text +bioscancast/llm/ ``` -bioscancast/evaluation/ -``` -Contains benchmarking and evaluation logic. +Current files: + +| File | Purpose | +| ------------------ | ---------------------------------- | +| `base.py` | shared protocol + token accounting | +| `client.py` | legacy/simple OpenAI wrapper | +| `openai_client.py` | structured extraction client | +| `fake_client.py` | testing client | -Examples: +The repository currently contains two partially overlapping interfaces: -benchmark_loader.py -scoring.py -brier_score.py -calibration_metrics.py -human_comparison.py +```text +bioscancast/llm/base.py +bioscancast/llm/client.py +``` -Used to compare model forecasts against human forecasts. +These should eventually be unified. --- # Datasets -``` +```text bioscancast/datasets/ ``` -Contains definitions for forecasting datasets and curated source lists. - -Examples: +Curated source definitions and credibility tiers. -forecast_questions.py -biosecurity_sources.py +| File | Purpose | +| ------------------------ | ------------------------ | +| `biosecurity_sources.py` | curated source registry | +| `source_tiers.py` | source credibility tiers | --- -# Utilities +# Scripts -``` -bioscancast/utils/ +```text +scripts/ ``` -General purpose helpers used throughout the codebase. +Operational and smoke-test utilities. -Examples: +| Script | Purpose | +| --------------------- | --------------------------- | +| `run_search_stage.py` | run search stage | +| `run_filtering.py` | run filtering stage | +| `run_extraction.py` | run extraction stage | +| `run_insight.py` | run insight smoke test | +| `eval_docling.py` | Docling evaluation | +| `eval_hybrid_pdf.py` | PDF extraction benchmarking | -logging utilities -configuration loading -rate limiting -caching +Scripts are intended for operational workflows rather than reusable library APIs. --- -# Configurations - -``` -configs/ -``` +# Running the Pipeline -Configuration files for models, scraping behavior, and pipeline parameters. +## Environment Setup -Examples: +```bash +python -m venv .venv +source .venv/bin/activate +pip install -r requirements.txt +``` -model configuration -API settings -scraping limits -LLM prompt settings +Additional packages: -These files allow experimentation without modifying code. +```bash +pip install openai tavily-python python-dotenv +``` --- -# Data +## Environment Variables -``` -data/ +Create `.env`: + +```bash +OPENAI_API_KEY=sk-... +TAVILY_API_KEY=tvly-... ``` -Stores intermediate and benchmark data. +--- -Subdirectories: +## Search Stage -raw -original scraped data +```bash +python scripts/run_search_stage.py \ + "Will H5N1 cause more than 100 human cases in the US by December 2026?" \ + --pathogen h5n1 \ + --region "United States" +``` -processed -cleaned datasets +Optional JSON output: -benchmarks -forecasting evaluation datasets +```bash +python scripts/run_search_stage.py \ + "How many mpox cases will be reported globally by June 2026?" \ + --pathogen mpox \ + --output data/search_results.json +``` --- -# Scripts +## Filtering Stage -``` -scripts/ +```bash +python scripts/run_filtering.py ``` -Command line tools used to run experiments and pipelines. +Current limitation: -Examples: +* uses hardcoded sample inputs rather than automatic search-stage ingestion -run_pipeline.py -Runs the full forecasting pipeline. +Output: -run_benchmark.py -Evaluates model forecasts against benchmark datasets. +```text +data/filtered_results.json +``` -scrape_sources.py -Bulk scraping utility for collecting documents. +--- -evaluate_forecasts.py -Computes evaluation metrics. +## Extraction Stage -Scripts are intended for operational tasks rather than reusable code. +Smoke-test mode: ---- +```bash +python scripts/run_extraction.py +``` -# Notebooks +Using filtered-document JSON: +```bash +python scripts/run_extraction.py \ + --input data/filtered_results.json ``` -notebooks/ + +Output: + +```text +data/extraction_results.json ``` -Used for exploratory analysis and experimentation. +--- -Examples: +## Insight Stage -model experiments -prompt exploration -benchmark analysis +```bash +python scripts/run_insight.py +``` -Notebook code should not be required for the main pipeline. +Current limitation: + +* uses synthetic documents rather than extraction outputs --- -# Tests +# Closest Current End-to-End Flow -``` -tests/ +```bash +python scripts/run_search_stage.py \ + "Will mpox cases increase in Uganda in 2026?" \ + --pathogen mpox \ + --region Uganda \ + --output data/search_results.json && \ +python scripts/run_filtering.py && \ +python scripts/run_extraction.py \ + --input data/filtered_results.json ``` -Unit and integration tests for pipeline components. +`bioscancast/main.py` currently contains pseudocode only and is not yet a runnable orchestrator. -Examples: +--- -stage level tests -pipeline execution tests -schema validation tests +# Tests -Each pipeline stage should include test coverage. +```text +bioscancast/tests/ +``` ---- +Includes: -# Documentation +* extraction tests +* retrieval tests +* pipeline tests +* schema validation +* search-stage integration tests -``` -docs/ +Run all tests: + +```bash +pytest ``` -Project documentation and architecture notes. +Run selected tests: -Examples: +```bash +pytest bioscancast/tests/test_extraction_pipeline.py +pytest bioscancast/tests/test_insight_pipeline.py +``` -system architecture -pipeline design -benchmark methodology -API documentation +Live fetch tests are marked: -These documents support research publication and developer onboarding. +```python +@pytest.mark.live +``` ---- +and skipped by default. -# Development Principles +Run with: -1. Pipeline stages must remain modular. -2. Data passed between stages must use schemas. -3. Stages should not import logic from other stages. -4. Experimental code should live in notebooks or scripts. -5. Reproducibility is a core requirement. +```bash +pytest --live +``` --- # Dependencies -Python dependencies are listed in `requirements.txt`. A few notes worth -flagging for contributors working on the extraction stage: +Important dependencies: -- **curl_cffi** is used for all HTTP fetches in `bioscancast/extraction/fetcher.py`. It wraps libcurl with browser TLS-fingerprint impersonation (Chrome by default), which lets us reach Cloudflare-fronted news sources (e.g. Reuters) that reject the default Python TLS stack used by `httpx`/`requests` with HTTP 401/403. The impersonation profile is configurable via `ExtractionConfig.impersonate`. See [issue #18](https://github.com/algorithmicgovernance/BioScanCast/issues/18) for background. -- **Live network tests** are marked `@pytest.mark.live` and skipped by default. Run them with `pytest --live` to verify the fetcher still works against real CDN-fronted endpoints. +| Dependency | Usage | +| ------------ | ----------------------------------- | +| `curl_cffi` | browser-fingerprinted HTTP fetching | +| `rank_bm25` | lexical retrieval | +| `PyMuPDF` | primary PDF parsing | +| `pdfplumber` | fallback PDF table extraction | ---- +`curl_cffi` is used in: -# Running the Pipeline +```text +bioscancast/extraction/fetcher.py +``` -Example: +The impersonation profile is configurable via: -``` -python scripts/run_pipeline.py +```python +ExtractionConfig.impersonate ``` -Example benchmark run: +--- -``` -python scripts/run_benchmark.py -``` +# Development Principles + +1. Keep pipeline stages modular. +2. Use schemas between stages. +3. Prefer structured interfaces over raw dictionaries. +4. Keep experimental workflows in scripts or notebooks. +5. Prioritize reproducibility. +6. Treat provenance and auditability as first-class concerns. --- -# Contributing +# Known Architectural Gaps -Developers should work within a single pipeline stage whenever possible. Changes that affect data contracts or schemas should be discussed before merging. +Major missing components: -All contributions should include tests. \ No newline at end of file +1. unified end-to-end orchestrator +2. extraction → insight integration +3. OCR support +4. forecasting stage implementation +5. persistent storage/vector DB layer +6. unified LLM abstraction +7. multilingual retrieval