Creating structured question-answer training datasets from BRC resources and publications to train LLMs for bioinformatics workflow generation
This is a project from the NIAID BRC AI Codeathon 2025, taking place November 12-14, 2025 at Argonne National Laboratory.
Event Website: https://niaid-brc-codeathons.github.io/
Project Details: https://niaid-brc-codeathons.github.io/projects/workflow-training-data-extraction/
The NIAID Bioinformatics Resource Centers (BRCs) invite researchers, data scientists, and developers to a three-day AI Codeathon focused on improving Findability, Accessibility, Interoperability, and Reusability (FAIR-ness) of BRC data and tools using artificial intelligence (AI) and large language models (LLMs).
AI-powered PDF analysis tool using ALCF's Sophia inference endpoint to extract text, images, and generate scientific insights from research papers.
-
Install dependencies:
pip install -r requirements-sophia.txt
-
Configure authentication:
# Option 1: Environment variable export SOPHIA_ACCESS_TOKEN="your-globus-access-token" # Option 2: Config file cp config/sophia.template.json config/sophia.json # Edit config/sophia.json and add your token
-
Get a Globus access token:
# Download auth helper (if not already available) wget https://raw.githubusercontent.com/argonne-lcf/inference-endpoints/refs/heads/main/inference_auth_token.py # Authenticate python scripts/inference_auth_token.py authenticate export SOPHIA_ACCESS_TOKEN=$(python scripts/inference_auth_token.py get_access_token)
-
Run the analyzer:
python scripts/sophia_pdf_analyzer.py your_paper.pdf
- Quick Start Guide: docs/SOPHIA_QUICKSTART.md
- Full Documentation: docs/SOPHIA_INTEGRATION.md
- PDF Extraction Comparison: docs/PDF_EXTRACTION_COMPARISON.md
- Llama-4-Scout Testing: docs/LLAMA4_SCOUT_TESTING.md
- PDF Content Extraction: Automatic text and image extraction from multi-page PDFs
- AI Analysis: Document summarization and key findings identification
- Question Generation: Automatic generation of research questions
- Multimodal Support: Text analysis and vision capabilities
- Flexible Authentication: Environment variables or config file
- Structured Output: JSON output with complete analysis results
- Method Comparison: Compare PyMuPDF vs Sophia direct extraction
Compare two PDF extraction approaches:
# Compare PyMuPDF local extraction vs Sophia direct PDF processing
python scripts/compare_pdf_extraction.py paper.pdf --output comparison.jsonSee PDF Extraction Comparison for detailed analysis of both methods.
The Llama-4-Scout-17B-16E-Instruct model is now available with native multimodal capabilities:
- ✅ Text + Image processing
- ✅ 10 million token context window
- ✅ Document extraction and chart analysis
- ❓ Potential direct PDF support (needs testing)
Test if it supports PDF processing:
# Test Llama-4-Scout's PDF capabilities
python scripts/test_llama4_scout_pdf.py paper.pdfSee Llama-4-Scout Testing Guide for detailed testing instructions and model capabilities.
Team members: Add your team information here.
To be determined by the team.