Advanced academic identifier extraction and validation system with comprehensive assessment capabilities.
- π Reference Extraction: Extract academic references from Deepsearch results in various formats
- π Multi-Phase Identifier Extraction: Extract DOI, PMID, and PMC identifiers from academic URLs using URL patterns, web scraping, and PDF text analysis
- π― AI-Powered Topic Validation: LLM-based relevance assessment to ensure extracted papers match your research domain (e.g., astrocyte biology)
- π Comprehensive Validation Pipeline: Multi-layered validation using format checking, NCBI API verification, and metapub integration
- π Detailed Reporting & Visualization: Interactive HTML reports with charts, statistics, and actionable recommendations
- π¬ Manual Review Guidance: Systematic sampling strategies and pause-point assessments for quality control
- π€ Unified LLM API: Support for OpenAI, Anthropic, and 100+ other providers via LiteLLM
- π Multiple Citation Formats: Handle numbered citations ([1]), author-year (Smith et al., 2024), and plain URLs
# Clone the repository
git clone https://github.com/dosumis/lit_agent.git
cd lit_agent
# Install with uv (recommended)
uv sync --dev
# Or with pip
pip install -e ".[dev]"# Copy example environment file
cp .env.example .env
# Edit .env and add your API keys
OPENAI_API_KEY=your_openai_key_here
ANTHROPIC_API_KEY=your_anthropic_key_here
# For academic identifier validation (required for validation features)
NCBI_EMAIL=your_email@domain.com # Should be registered with NCBI
NCBI_API_KEY=your_ncbi_key # Optional but recommended for higher rate limitsTake a DeepSearch-style bibliography (URLs, optionally with source_id) and return CSL-JSON keyed by the original reference numbers:
from lit_agent.identifiers import resolve_bibliography
bibliography = [
{"source_id": "1", "url": "https://pubmed.ncbi.nlm.nih.gov/37674083/"},
{"source_id": "2", "url": "https://pmc.ncbi.nlm.nih.gov/articles/PMC11239014/"},
{"source_id": "3", "url": "https://doi.org/10.1038/s41586-023-06502-w"},
]
result = resolve_bibliography(
bibliography,
validate=True, # NCBI/metapub validation + metadata fetch
scrape=False, # Enable if you want web/PDF scraping for failures
pdf=False,
topic_validation=False,
)
print(result.citations["1"]["PMID"]) # "37674083"
print(result.citations["2"]["PMCID"]) # "PMC11239014"
print(result.citations["3"]["DOI"]) # "10.1038/s41586-023-06502-w"
print(result.citations["1"]["resolution"]) # methods, confidence, validation, errorsEach citation is CSL-JSONβcompatible with a custom resolution block:
idis the originalsource_id(or 1-based string if absent)URL, identifiers (DOI/PMID/PMCID), optional metadata (title,author,container-title,issued, etc.)resolution:confidence,methods,validationstatuses,errors,source_url, optionalcanonical_id
Render to compact text with citeproc-py (optional dependency):
uv add --dev citeproc-pyfrom lit_agent.identifiers import render_bibliography_to_strings
rendered, meta = render_bibliography_to_strings(result, style="vancouver")
for line in rendered:
print(line) # e.g., "[1] Doe et al. 2024 Example Paper 10.1038/s41586-023-06502-w"If citeproc-py is not installed, the helper falls back to a minimal compact formatter.
Extract DOI, PMID, and PMC identifiers from academic URLs with comprehensive validation:
from lit_agent.identifiers import extract_identifiers_from_bibliography
# Basic extraction from URLs
urls = [
"https://pubmed.ncbi.nlm.nih.gov/37674083/",
"https://www.nature.com/articles/s41586-023-06812-z",
"https://pmc.ncbi.nlm.nih.gov/articles/PMC11239014/"
]
result = extract_identifiers_from_bibliography(
urls=urls,
use_web_scraping=True, # Enable Phase 2 web scraping
use_api_validation=True, # Enable NCBI API validation
use_topic_validation=True # Enable LLM topic validation
)
print(f"Found {len(result.identifiers)} identifiers")
print(f"Success rate: {result.success_rate:.1%}")Run a complete validation assessment with detailed reporting and visualizations:
from lit_agent.identifiers.validation_demo import run_validation_assessment_demo
# Run comprehensive validation assessment
report = run_validation_assessment_demo(
urls=None, # Uses default astrocyte biology test URLs
use_topic_validation=True,
output_dir="validation_reports",
report_name="my_assessment"
)
# Check validation quality score
print(f"Validation Quality Score: {report['quality_score']}/100")This generates:
- JSON Report: Complete validation statistics and metadata
- Text Summary: Human-readable assessment with recommendations
- CSV Export: Detailed paper information for spreadsheet analysis
- Interactive HTML: Visual dashboard with charts and insights
- Visualizations: 6 different chart types analyzing validation performance
Validate that extracted papers are relevant to your research topic:
from lit_agent.identifiers.topic_validator import TopicValidator
validator = TopicValidator()
# Validate a single identifier for astrocyte biology relevance
identifier = result.identifiers[0]
validation_result = validator.validate_identifier(identifier)
print(f"Relevant: {validation_result['is_relevant']}")
print(f"Confidence: {validation_result['confidence']}%")
print(f"Reasoning: {validation_result['reasoning']}")The system provides systematic guidance for manual review:
# Generate paper classifications for manual review
from lit_agent.identifiers.reporting import ValidationReporter
reporter = ValidationReporter()
report = reporter.generate_validation_report(results, "manual_review")
# Papers needing manual review
classifications = report["paper_classifications"]
needs_review = classifications["needs_manual_review"]
low_confidence = classifications["low_confidence_relevant"]
print(f"Papers requiring manual review: {len(needs_review)}")
print(f"Low confidence papers: {len(low_confidence)}")Use the unified LLM API for custom analyses:
from lit_agent.agent_connection import create_agent_from_env
# Create agents from environment variables
agent = create_agent_from_env("anthropic")
response = agent.query("Analyze this paper abstract for astrocyte biology relevance...")Run validation assessments directly from the command line:
# Run demo with default astrocyte biology URLs
uv run python -m lit_agent.identifiers.validation_demo
# Or run with Python directly
python src/lit_agent/identifiers/validation_demo.py
# Check the generated reports
ls demo_reports/
# validation_demo_20241105_143022.json
# validation_demo_20241105_143022_summary.txt
# validation_demo_20241105_143022_papers.csv
# validation_demo_20241105_143022.htmlThe demo script provides:
- β Real-time Progress: Live updates on extraction and validation progress
- π Immediate Results: Success rates, identifier counts, and confidence distributions
- π― Topic Analysis: Relevance assessment for astrocyte biology research
- π‘ Actionable Recommendations: Specific suggestions for quality improvement
- π Interactive Reports: HTML dashboard with embedded visualizations
1. NCBI API Rate Limiting
# Error: HTTPSConnectionPool... Read timed out
# Solution: Add NCBI API key and email to .env
NCBI_EMAIL=your_email@domain.com
NCBI_API_KEY=your_ncbi_key2. LLM API Errors
# Error: No API key provided
# Solution: Verify your .env file has the correct keys
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...3. Missing Dependencies
# Error: No module named 'matplotlib'
# Solution: Install visualization dependencies
uv sync --dev
# or
pip install matplotlib beautifulsoup4 pypdf lxml4. Low Validation Quality Scores
- Check Topic Validation: Ensure your research domain matches the built-in astrocyte biology validation
- Review URLs: Verify input URLs are from academic sources
- API Connectivity: Confirm NCBI API access is working
- Manual Review: Use the paper classifications to identify systematic issues
For Large URL Lists:
- Enable caching for topic validation results
- Use batch processing for NCBI API calls
- Consider running validation in parallel chunks
- Monitor API rate limits and adjust delays
For Custom Research Domains:
- Modify the topic validation prompts in
topic_validator.py - Update keyword lists for your specific field
- Adjust confidence thresholds based on domain expertise
# Install development dependencies
uv sync --dev
### Testing
```bash
# Run all tests (currently only tests required, code quality checks paused)
uv run pytest
# Run only unit tests (fast)
uv run pytest -m unit
# Run integration tests (requires API keys)
uv run pytest -m integration
# Run with coverage
uv run pytest --cov
This project follows strict Test-Driven Development with real integration testing:
- Unit Tests: Fast, isolated tests with mocks
- Integration Tests: Real API calls when keys available, mock fallback with warnings
- No Mocks for Integration: Real API testing is prioritized, mocks only as fallback
- Coverage tracking optional for now
# With API keys: Real API calls
--- Anthropic Hello World Response (REAL API) ---
The first recorded use of "Hello, World!" to demonstrate a programming language...
# Without API keys: Mock fallback with warning
UserWarning: ANTHROPIC_API_KEY not found - falling back to mock test.
--- Anthropic Hello World Response (MOCK) ---- Phase 1 - URL Pattern Extraction: Fast extraction using regex patterns for known journal URLs
- Phase 2 - Web Scraping: BeautifulSoup-based scraping for meta tags and JSON-LD data
- Phase 3 - PDF Text Analysis: LLM-powered extraction from PDF content when available
- Format Validation: Verify identifier formats (DOI, PMID, PMC patterns)
- NCBI API Validation: Real-time verification against PubMed database with metadata retrieval
- Metapub Integration: Cross-validation using metapub library
- Topic Validation: LLM-based assessment of paper relevance to research domains
- Comprehensive Statistics: Success rates, processing times, confidence distributions
- Interactive Visualizations: 6 chart types including confidence histograms, method comparisons, and topic analysis
- Quality Scoring: Data-driven assessment with actionable recommendations
- Manual Review Guidance: Stratified sampling strategies based on confidence scores
The system provides systematic checkpoints for quality control:
- Validation Quality Score: 0-100 rating based on relevance, confidence, and success rates
- Automated Recommendations: Specific suggestions for improving extraction quality
- Paper Classifications: Systematic categorization for manual review prioritization
- Statistical Robustness: Confidence intervals and sample size recommendations
- LiteLLM Integration: Unified API for 100+ LLM providers
- Environment-based Configuration: API keys via dotenv
- Modular Design: Abstract base classes with concrete implementations
- Error Handling: Comprehensive error handling with meaningful messages
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Write tests first (TDD approach)
- Implement the feature
- Ensure all tests pass (
uv run pytest) - (Optional) Run additional checks later if re-enabled
- Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.