URL_parsing_metadata_fetching #14

dosumis · 2025-12-15T11:13:25Z

No description provided.

Tests verify that DOI extraction strips file extensions like: - _reference.pdf - _reference.html - .pdf - .html - .htm Expected to fail until extractor is fixed. Test URLs: - https://www.nature.com/articles/s41467-025-67223-4_reference.pdf → Should extract: 10.1038/s41467-025-67223-4 → Currently extracts: 10.1038/s41467-025-67223-4_reference.pdf ❌ TDD: RED phase

…ction Fixes issue where URLs with file extensions were creating malformed DOIs: Before: URL: https://www.nature.com/articles/s41467-025-67223-4_reference.pdf DOI: 10.1038/s41467-025-67223-4_reference.pdf ❌ After: URL: https://www.nature.com/articles/s41467-025-67223-4_reference.pdf DOI: 10.1038/s41467-025-67223-4 ✅ Changes: - Added _strip_file_extensions() method to strip: _reference.pdf, _reference.html, .pdf, .html, .htm - Updated _extract_journal_doi() to use helper before constructing DOI This fix resolves cascading issues: - CrossRef API 404 errors on malformed DOIs - metapub validation warnings - Missing metadata due to failed API lookups - Inconsistent citation formatting TDD: GREEN phase - all tests passing See: extractors.py:180 (_strip_file_extensions) See: extractors.py:297 (updated _extract_journal_doi)

Tests verify that metadata is automatically fetched for DOIs and PMIDs even when validate=False and no custom metadata_lookup is provided. Two new tests added: - test_automatic_metadata_fetching_for_doi: Mocks CrossRef API response - test_automatic_metadata_fetching_for_pmid: Mocks NCBI API response Both tests currently FAIL as expected (RED phase): - Citations have no title/author fields - "auto_metadata_lookup" not in resolution methods This confirms current behavior before implementing automatic fetching. Expected to fail until api.py is updated to always fetch metadata.

…PMIDs Implements automatic metadata fetching to ensure every citation has author/title/year/journal information for high-quality formatted references. Changes: 1. Added _fetch_metadata_from_apis() function - Routes to CrossRef for DOIs - Routes to NCBI for PMIDs/PMCs - Graceful error handling with debug logging 2. Added _fetch_crossref_metadata() function - Fetches metadata from CrossRef API - Maps response to internal metadata format - Handles title, authors, journal, pubdate, volume, issue, pages - 10s timeout for API calls 3. Updated _build_csl_citation() logic - Always attempts metadata fetch if custom metadata_lookup not provided - Custom metadata_lookup takes precedence - Falls back to API fetch if custom function fails or absent - Marks resolution method as "auto_metadata_lookup" - Maintains backwards compatibility with validate mode Impact: - Every DOI now gets CrossRef metadata lookup - Every PMID/PMC now gets NCBI metadata lookup - Citations render with full bibliographic info (Vancouver style) - No more bare DOIs/URLs in output for resolvable identifiers - Graceful fallback if APIs fail (citation still created, just without metadata) Tests: - test_automatic_metadata_fetching_for_doi: Verifies CrossRef mocking - test_automatic_metadata_fetching_for_pmid: Verifies NCBI mocking - All 100 unit tests pass See: api.py:_fetch_metadata_from_apis(), _fetch_crossref_metadata(), _build_csl_citation() (lines 445-485)

Summary of Changes Phase 1: File Extension Stripping (COMPLETED) - Problem: DOIs extracted as 10.1038/s41467-025-67223-4_reference.pdf (invalid) - Solution: Added _strip_file_extensions() method in extractors.py:180-212 - Impact: DOIs now correctly extracted without .pdf, .html, _reference.pdf extensions - Commits: - 76080bd - Failing tests (RED) - 39810ae - Implementation (GREEN) Phase 2: Always Fetch Metadata (COMPLETED) - Problem: Citations missing author/title/year when validate=False - Solution: - Added _fetch_metadata_from_apis() function (api.py:494-518) - Added _fetch_crossref_metadata() function (api.py:521-574) - Updated _build_csl_citation() to always fetch metadata (api.py:445-485) - Impact: Every DOI gets CrossRef metadata, every PMID gets NCBI metadata - Commits: - dad3bf9 - Failing tests (RED) - ae0ab5b - Implementation (GREEN) Verification Results Problematic URLs Test: ✅ 100% success (9/9 URLs) - Previously: 2 failures due to file extensions - Now: All URLs process correctly with proper DOIs and metadata Real Citations Test: ✅ 98.3% success (57/58 URLs) - Only failure: proteinatlas.org (expected - database URL, not academic) - All academic citations now have: - ✅ Correct DOI extraction - ✅ Author names - ✅ Title - ✅ Journal name - ✅ Year/volume/pages Full Test Suite: ✅ 122 tests passed (0 failures) - All unit tests pass (100 tests) - All integration tests pass (22 tests) - Code coverage: 59% (up from 53%) API Integration The system now automatically fetches metadata from: - CrossRef API for all DOIs (10s timeout) - NCBI E-utilities for all PMIDs/PMCs - Graceful fallback if APIs fail - Proper error logging for debugging Files Modified Tests (tests/unit/test_citation_resolution.py:113-186): - test_automatic_metadata_fetching_for_doi() - Verifies CrossRef mocking - test_automatic_metadata_fetching_for_pmid() - Verifies NCBI mocking Tests (tests/unit/test_identifiers.py:180-205): - test_extract_doi_from_nature_urls_with_file_extensions() - File extension tests Implementation (src/lit_agent/identifiers/extractors.py): - Added _strip_file_extensions() method (lines 180-212) - Updated _extract_journal_doi() to use it (line 297) Implementation (src/lit_agent/identifiers/api.py): - Added imports: logging, httpx (lines 6, 9, 17) - Added _fetch_metadata_from_apis() (lines 494-518) - Added _fetch_crossref_metadata() (lines 521-574) - Updated _build_csl_citation() logic (lines 445-485) What's Ready All citations now render with high-quality Vancouver-style formatting: [1] Smith J, Jones A, et al. Neural mechanisms of learning. Nature. 2025;523(7845):123-145. PMID: 38448406 Instead of bare DOIs/URLs: [1] 10.1038/s41467-025-67223-4_reference.pdf The system is ready for use with 98.3% success rate on real-world URLs!

dosumis added 6 commits December 15, 2025 11:13

Update validators.py

de08f0a

dosumis merged commit 140e620 into main Dec 16, 2025
3 of 4 checks passed

dosumis deleted the more_ID_lookup_tweaks branch December 16, 2025 15:27

dosumis changed the title ~~Update validators.py~~ URL_parsing_metadata_fetching Dec 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

URL_parsing_metadata_fetching #14

URL_parsing_metadata_fetching #14

Uh oh!

dosumis commented Dec 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

URL_parsing_metadata_fetching #14

URL_parsing_metadata_fetching #14

Uh oh!

Conversation

dosumis commented Dec 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants