Skip to content

Conversation

@dosumis
Copy link
Collaborator

@dosumis dosumis commented Dec 15, 2025

No description provided.

Tests verify that DOI extraction strips file extensions like:
- _reference.pdf
- _reference.html
- .pdf
- .html
- .htm

Expected to fail until extractor is fixed.

Test URLs:
- https://www.nature.com/articles/s41467-025-67223-4_reference.pdf
  → Should extract: 10.1038/s41467-025-67223-4
  → Currently extracts: 10.1038/s41467-025-67223-4_reference.pdf ❌

TDD: RED phase
…ction

Fixes issue where URLs with file extensions were creating malformed DOIs:

Before:
  URL:  https://www.nature.com/articles/s41467-025-67223-4_reference.pdf
  DOI:  10.1038/s41467-025-67223-4_reference.pdf ❌

After:
  URL:  https://www.nature.com/articles/s41467-025-67223-4_reference.pdf
  DOI:  10.1038/s41467-025-67223-4 ✅

Changes:
- Added _strip_file_extensions() method to strip: _reference.pdf,
  _reference.html, .pdf, .html, .htm
- Updated _extract_journal_doi() to use helper before constructing DOI

This fix resolves cascading issues:
- CrossRef API 404 errors on malformed DOIs
- metapub validation warnings
- Missing metadata due to failed API lookups
- Inconsistent citation formatting

TDD: GREEN phase - all tests passing

See: extractors.py:180 (_strip_file_extensions)
See: extractors.py:297 (updated _extract_journal_doi)
Tests verify that metadata is automatically fetched for DOIs and PMIDs
even when validate=False and no custom metadata_lookup is provided.

Two new tests added:
- test_automatic_metadata_fetching_for_doi: Mocks CrossRef API response
- test_automatic_metadata_fetching_for_pmid: Mocks NCBI API response

Both tests currently FAIL as expected (RED phase):
- Citations have no title/author fields
- "auto_metadata_lookup" not in resolution methods

This confirms current behavior before implementing automatic fetching.

Expected to fail until api.py is updated to always fetch metadata.
…PMIDs

Implements automatic metadata fetching to ensure every citation has
author/title/year/journal information for high-quality formatted references.

Changes:
1. Added _fetch_metadata_from_apis() function
   - Routes to CrossRef for DOIs
   - Routes to NCBI for PMIDs/PMCs
   - Graceful error handling with debug logging

2. Added _fetch_crossref_metadata() function
   - Fetches metadata from CrossRef API
   - Maps response to internal metadata format
   - Handles title, authors, journal, pubdate, volume, issue, pages
   - 10s timeout for API calls

3. Updated _build_csl_citation() logic
   - Always attempts metadata fetch if custom metadata_lookup not provided
   - Custom metadata_lookup takes precedence
   - Falls back to API fetch if custom function fails or absent
   - Marks resolution method as "auto_metadata_lookup"
   - Maintains backwards compatibility with validate mode

Impact:
- Every DOI now gets CrossRef metadata lookup
- Every PMID/PMC now gets NCBI metadata lookup
- Citations render with full bibliographic info (Vancouver style)
- No more bare DOIs/URLs in output for resolvable identifiers
- Graceful fallback if APIs fail (citation still created, just without metadata)

Tests:
- test_automatic_metadata_fetching_for_doi: Verifies CrossRef mocking
- test_automatic_metadata_fetching_for_pmid: Verifies NCBI mocking
- All 100 unit tests pass

See: api.py:_fetch_metadata_from_apis(), _fetch_crossref_metadata(),
     _build_csl_citation() (lines 445-485)
 Summary of Changes

  Phase 1: File Extension Stripping (COMPLETED)
  - Problem: DOIs extracted as 10.1038/s41467-025-67223-4_reference.pdf (invalid)
  - Solution: Added _strip_file_extensions() method in extractors.py:180-212
  - Impact: DOIs now correctly extracted without .pdf, .html, _reference.pdf extensions
  - Commits:
    - 76080bd - Failing tests (RED)
    - 39810ae - Implementation (GREEN)

  Phase 2: Always Fetch Metadata (COMPLETED)
  - Problem: Citations missing author/title/year when validate=False
  - Solution:
    - Added _fetch_metadata_from_apis() function (api.py:494-518)
    - Added _fetch_crossref_metadata() function (api.py:521-574)
    - Updated _build_csl_citation() to always fetch metadata (api.py:445-485)
  - Impact: Every DOI gets CrossRef metadata, every PMID gets NCBI metadata
  - Commits:
    - dad3bf9 - Failing tests (RED)
    - ae0ab5b - Implementation (GREEN)

  Verification Results

  Problematic URLs Test: ✅ 100% success (9/9 URLs)
  - Previously: 2 failures due to file extensions
  - Now: All URLs process correctly with proper DOIs and metadata

  Real Citations Test: ✅ 98.3% success (57/58 URLs)
  - Only failure: proteinatlas.org (expected - database URL, not academic)
  - All academic citations now have:
    - ✅ Correct DOI extraction
    - ✅ Author names
    - ✅ Title
    - ✅ Journal name
    - ✅ Year/volume/pages

  Full Test Suite: ✅ 122 tests passed (0 failures)
  - All unit tests pass (100 tests)
  - All integration tests pass (22 tests)
  - Code coverage: 59% (up from 53%)

  API Integration

  The system now automatically fetches metadata from:
  - CrossRef API for all DOIs (10s timeout)
  - NCBI E-utilities for all PMIDs/PMCs
  - Graceful fallback if APIs fail
  - Proper error logging for debugging

  Files Modified

  Tests (tests/unit/test_citation_resolution.py:113-186):
  - test_automatic_metadata_fetching_for_doi() - Verifies CrossRef mocking
  - test_automatic_metadata_fetching_for_pmid() - Verifies NCBI mocking

  Tests (tests/unit/test_identifiers.py:180-205):
  - test_extract_doi_from_nature_urls_with_file_extensions() - File extension tests

  Implementation (src/lit_agent/identifiers/extractors.py):
  - Added _strip_file_extensions() method (lines 180-212)
  - Updated _extract_journal_doi() to use it (line 297)

  Implementation (src/lit_agent/identifiers/api.py):
  - Added imports: logging, httpx (lines 6, 9, 17)
  - Added _fetch_metadata_from_apis() (lines 494-518)
  - Added _fetch_crossref_metadata() (lines 521-574)
  - Updated _build_csl_citation() logic (lines 445-485)

  What's Ready

  All citations now render with high-quality Vancouver-style formatting:
  [1] Smith J, Jones A, et al. Neural mechanisms of learning. Nature. 2025;523(7845):123-145. PMID: 38448406

  Instead of bare DOIs/URLs:
  [1] 10.1038/s41467-025-67223-4_reference.pdf

  The system is ready for use with 98.3% success rate on real-world URLs!
@dosumis dosumis merged commit 140e620 into main Dec 16, 2025
3 of 4 checks passed
@dosumis dosumis deleted the more_ID_lookup_tweaks branch December 16, 2025 15:27
@dosumis dosumis changed the title Update validators.py URL_parsing_metadata_fetching Dec 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants