-
Notifications
You must be signed in to change notification settings - Fork 0
URL_parsing_metadata_fetching #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Tests verify that DOI extraction strips file extensions like: - _reference.pdf - _reference.html - .pdf - .html - .htm Expected to fail until extractor is fixed. Test URLs: - https://www.nature.com/articles/s41467-025-67223-4_reference.pdf → Should extract: 10.1038/s41467-025-67223-4 → Currently extracts: 10.1038/s41467-025-67223-4_reference.pdf ❌ TDD: RED phase
…ction Fixes issue where URLs with file extensions were creating malformed DOIs: Before: URL: https://www.nature.com/articles/s41467-025-67223-4_reference.pdf DOI: 10.1038/s41467-025-67223-4_reference.pdf ❌ After: URL: https://www.nature.com/articles/s41467-025-67223-4_reference.pdf DOI: 10.1038/s41467-025-67223-4 ✅ Changes: - Added _strip_file_extensions() method to strip: _reference.pdf, _reference.html, .pdf, .html, .htm - Updated _extract_journal_doi() to use helper before constructing DOI This fix resolves cascading issues: - CrossRef API 404 errors on malformed DOIs - metapub validation warnings - Missing metadata due to failed API lookups - Inconsistent citation formatting TDD: GREEN phase - all tests passing See: extractors.py:180 (_strip_file_extensions) See: extractors.py:297 (updated _extract_journal_doi)
Tests verify that metadata is automatically fetched for DOIs and PMIDs even when validate=False and no custom metadata_lookup is provided. Two new tests added: - test_automatic_metadata_fetching_for_doi: Mocks CrossRef API response - test_automatic_metadata_fetching_for_pmid: Mocks NCBI API response Both tests currently FAIL as expected (RED phase): - Citations have no title/author fields - "auto_metadata_lookup" not in resolution methods This confirms current behavior before implementing automatic fetching. Expected to fail until api.py is updated to always fetch metadata.
…PMIDs
Implements automatic metadata fetching to ensure every citation has
author/title/year/journal information for high-quality formatted references.
Changes:
1. Added _fetch_metadata_from_apis() function
- Routes to CrossRef for DOIs
- Routes to NCBI for PMIDs/PMCs
- Graceful error handling with debug logging
2. Added _fetch_crossref_metadata() function
- Fetches metadata from CrossRef API
- Maps response to internal metadata format
- Handles title, authors, journal, pubdate, volume, issue, pages
- 10s timeout for API calls
3. Updated _build_csl_citation() logic
- Always attempts metadata fetch if custom metadata_lookup not provided
- Custom metadata_lookup takes precedence
- Falls back to API fetch if custom function fails or absent
- Marks resolution method as "auto_metadata_lookup"
- Maintains backwards compatibility with validate mode
Impact:
- Every DOI now gets CrossRef metadata lookup
- Every PMID/PMC now gets NCBI metadata lookup
- Citations render with full bibliographic info (Vancouver style)
- No more bare DOIs/URLs in output for resolvable identifiers
- Graceful fallback if APIs fail (citation still created, just without metadata)
Tests:
- test_automatic_metadata_fetching_for_doi: Verifies CrossRef mocking
- test_automatic_metadata_fetching_for_pmid: Verifies NCBI mocking
- All 100 unit tests pass
See: api.py:_fetch_metadata_from_apis(), _fetch_crossref_metadata(),
_build_csl_citation() (lines 445-485)
Summary of Changes
Phase 1: File Extension Stripping (COMPLETED)
- Problem: DOIs extracted as 10.1038/s41467-025-67223-4_reference.pdf (invalid)
- Solution: Added _strip_file_extensions() method in extractors.py:180-212
- Impact: DOIs now correctly extracted without .pdf, .html, _reference.pdf extensions
- Commits:
- 76080bd - Failing tests (RED)
- 39810ae - Implementation (GREEN)
Phase 2: Always Fetch Metadata (COMPLETED)
- Problem: Citations missing author/title/year when validate=False
- Solution:
- Added _fetch_metadata_from_apis() function (api.py:494-518)
- Added _fetch_crossref_metadata() function (api.py:521-574)
- Updated _build_csl_citation() to always fetch metadata (api.py:445-485)
- Impact: Every DOI gets CrossRef metadata, every PMID gets NCBI metadata
- Commits:
- dad3bf9 - Failing tests (RED)
- ae0ab5b - Implementation (GREEN)
Verification Results
Problematic URLs Test: ✅ 100% success (9/9 URLs)
- Previously: 2 failures due to file extensions
- Now: All URLs process correctly with proper DOIs and metadata
Real Citations Test: ✅ 98.3% success (57/58 URLs)
- Only failure: proteinatlas.org (expected - database URL, not academic)
- All academic citations now have:
- ✅ Correct DOI extraction
- ✅ Author names
- ✅ Title
- ✅ Journal name
- ✅ Year/volume/pages
Full Test Suite: ✅ 122 tests passed (0 failures)
- All unit tests pass (100 tests)
- All integration tests pass (22 tests)
- Code coverage: 59% (up from 53%)
API Integration
The system now automatically fetches metadata from:
- CrossRef API for all DOIs (10s timeout)
- NCBI E-utilities for all PMIDs/PMCs
- Graceful fallback if APIs fail
- Proper error logging for debugging
Files Modified
Tests (tests/unit/test_citation_resolution.py:113-186):
- test_automatic_metadata_fetching_for_doi() - Verifies CrossRef mocking
- test_automatic_metadata_fetching_for_pmid() - Verifies NCBI mocking
Tests (tests/unit/test_identifiers.py:180-205):
- test_extract_doi_from_nature_urls_with_file_extensions() - File extension tests
Implementation (src/lit_agent/identifiers/extractors.py):
- Added _strip_file_extensions() method (lines 180-212)
- Updated _extract_journal_doi() to use it (line 297)
Implementation (src/lit_agent/identifiers/api.py):
- Added imports: logging, httpx (lines 6, 9, 17)
- Added _fetch_metadata_from_apis() (lines 494-518)
- Added _fetch_crossref_metadata() (lines 521-574)
- Updated _build_csl_citation() logic (lines 445-485)
What's Ready
All citations now render with high-quality Vancouver-style formatting:
[1] Smith J, Jones A, et al. Neural mechanisms of learning. Nature. 2025;523(7845):123-145. PMID: 38448406
Instead of bare DOIs/URLs:
[1] 10.1038/s41467-025-67223-4_reference.pdf
The system is ready for use with 98.3% success rate on real-world URLs!
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.