Extracting Training Data for Automated Workflow Generation

Creating structured question-answer training datasets from BRC resources and publications to train LLMs for bioinformatics workflow generation

About This Project

This is a project from the NIAID BRC AI Codeathon 2025, taking place November 12-14, 2025 at Argonne National Laboratory.

Event Website: https://niaid-brc-codeathons.github.io/

Project Details: https://niaid-brc-codeathons.github.io/projects/workflow-training-data-extraction/

Codeathon Goals

The NIAID Bioinformatics Resource Centers (BRCs) invite researchers, data scientists, and developers to a three-day AI Codeathon focused on improving Findability, Accessibility, Interoperability, and Reusability (FAIR-ness) of BRC data and tools using artificial intelligence (AI) and large language models (LLMs).

Getting Started

Sophia PDF Analyzer

AI-powered PDF analysis tool using ALCF's Sophia inference endpoint to extract text, images, and generate scientific insights from research papers.

Quick Start

Install dependencies:
```
pip install -r requirements-sophia.txt
```

Configure authentication:

# Option 1: Environment variable
export SOPHIA_ACCESS_TOKEN="your-globus-access-token"

# Option 2: Config file
cp config/sophia.template.json config/sophia.json
# Edit config/sophia.json and add your token

Get a Globus access token:

# Download auth helper (if not already available)
wget https://raw.githubusercontent.com/argonne-lcf/inference-endpoints/refs/heads/main/inference_auth_token.py

# Authenticate
python scripts/inference_auth_token.py authenticate
export SOPHIA_ACCESS_TOKEN=$(python scripts/inference_auth_token.py get_access_token)

Run the analyzer:

python scripts/sophia_pdf_analyzer.py your_paper.pdf

Documentation

Quick Start Guide: docs/SOPHIA_QUICKSTART.md
Full Documentation: docs/SOPHIA_INTEGRATION.md
PDF Extraction Comparison: docs/PDF_EXTRACTION_COMPARISON.md
Llama-4-Scout Testing: docs/LLAMA4_SCOUT_TESTING.md

Features

PDF Content Extraction: Automatic text and image extraction from multi-page PDFs
AI Analysis: Document summarization and key findings identification
Question Generation: Automatic generation of research questions
Multimodal Support: Text analysis and vision capabilities
Flexible Authentication: Environment variables or config file
Structured Output: JSON output with complete analysis results
Method Comparison: Compare PyMuPDF vs Sophia direct extraction

Comparing Extraction Methods

Compare two PDF extraction approaches:

# Compare PyMuPDF local extraction vs Sophia direct PDF processing
python scripts/compare_pdf_extraction.py paper.pdf --output comparison.json

See PDF Extraction Comparison for detailed analysis of both methods.

Testing Llama-4-Scout Multimodal Model

The Llama-4-Scout-17B-16E-Instruct model is now available with native multimodal capabilities:

✅ Text + Image processing
✅ 10 million token context window
✅ Document extraction and chart analysis
❓ Potential direct PDF support (needs testing)

Test if it supports PDF processing:

# Test Llama-4-Scout's PDF capabilities
python scripts/test_llama4_scout_pdf.py paper.pdf

See Llama-4-Scout Testing Guide for detailed testing instructions and model capabilities.

Team

Team members: Add your team information here.

License

To be determined by the team.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.devcontainer		.devcontainer
data		data
docs		docs
lib/python		lib/python
outbreak_data		outbreak_data
scripts		scripts
templates		templates
txts		txts
.gitignore		.gitignore
CDC_MMWR_Scraper.py		CDC_MMWR_Scraper.py
EuropeanCDC_Fire_ScraperTHISWORKS.py		EuropeanCDC_Fire_ScraperTHISWORKS.py
FireCrawl_Script_Scrape_Symptoms.py		FireCrawl_Script_Scrape_Symptoms.py
HealthMap_ScraperUPDATED.py		HealthMap_ScraperUPDATED.py
ProMED_Config_template.py		ProMED_Config_template.py
ProMED_ScrapeScreen.py		ProMED_ScrapeScreen.py
ProMED_Scraper.py		ProMED_Scraper.py
ProMED_ScraperTHISWORKS.py		ProMED_ScraperTHISWORKS.py
README.md		README.md
README_OUTBREAK_ORCHESTRATION.md		README_OUTBREAK_ORCHESTRATION.md
biothreat_detection.ipynb		biothreat_detection.ipynb
cdc_mmwr_report.json		cdc_mmwr_report.json
data_gatherer_agent.py		data_gatherer_agent.py
data_gathering_plan.json		data_gathering_plan.json
data_gathering_plan.md		data_gathering_plan.md
data_repository_writer.py		data_repository_writer.py
devils_advocate_analysis.md		devils_advocate_analysis.md
devils_advocate_analyzer.py		devils_advocate_analyzer.py
final_outbreak_validation_report.md		final_outbreak_validation_report.md
firecrawl_response_formatter.py		firecrawl_response_formatter.py
firecrawl_validation_agent.py		firecrawl_validation_agent.py
hypothesis_validation_agent.py		hypothesis_validation_agent.py
mmwcs_india_report.json		mmwcs_india_report.json
outbreak_analysis_orchestrator.py		outbreak_analysis_orchestrator.py
outbreak_flagger_argo.py		outbreak_flagger_argo.py
pipeline_summary.md		pipeline_summary.md
potential_outbreaks.md		potential_outbreaks.md
promed_outbreak_summary.json		promed_outbreak_summary.json
prompts.txt		prompts.txt
requirements-sophia.txt		requirements-sophia.txt
requirements.txt		requirements.txt
test_firecrawl_fix.py		test_firecrawl_fix.py
use_firecrawl_mmwcs_india.py		use_firecrawl_mmwcs_india.py
use_firecrawl_other_mmwcs.py		use_firecrawl_other_mmwcs.py
validation_results_temp_1.json		validation_results_temp_1.json
validation_results_temp_2.json		validation_results_temp_2.json
validation_summary.json		validation_summary.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Extracting Training Data for Automated Workflow Generation

About This Project

Codeathon Goals

Getting Started

Sophia PDF Analyzer

Quick Start

Documentation

Features

Comparing Extraction Methods

Testing Llama-4-Scout Multimodal Model

Team

License

About

Uh oh!

Releases

Packages

Contributors 7

Uh oh!

Languages

NIAID-BRC-Codeathons/workflow-training-data-extraction

Folders and files

Latest commit

History

Repository files navigation

Extracting Training Data for Automated Workflow Generation

About This Project

Codeathon Goals

Getting Started

Sophia PDF Analyzer

Quick Start

Documentation

Features

Comparing Extraction Methods

Testing Llama-4-Scout Multimodal Model

Team

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Uh oh!

Languages

Packages