Basic implementation in uv by gaurav · Pull Request #1 · TranslatorSRI/babel-explorer

gaurav · 2025-12-03T07:28:19Z

WIP

- Add IdentifierRecord dataclass to babel_xrefs.py (resolves TODO) - Add 89 tests across 3 files: test_downloader (26), test_babel_xrefs (31), test_nodenorm (23) - Unit tests (71) use mocks and run without network; integration tests (18) use real downloads/APIs - Add session-scoped fixtures in conftest.py for shared Parquet file downloads - Parametrize integration tests over tests/data/valid_curies.txt for easy expansion - Add integration and slow pytest markers to pyproject.toml - Update CLAUDE.md and README.md with testing documentation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

This pull request implements a basic version of babel-explorer in Python using the uv package manager. It's a tool for querying Babel intermediate files to understand why biological/chemical identifiers are considered equivalent. The implementation includes a downloader for large Parquet files with MD5 validation and resume support, NodeNorm API integration for label enrichment, DuckDB-based cross-reference querying, and a Click-based CLI.

Changes:

Initial project structure with uv-based package management (pyproject.toml, Python 3.11+)
Core functionality: BabelDownloader with streaming downloads and MD5 validation, NodeNorm API client with LRU caching, BabelXRefs for DuckDB-based Parquet queries
CLI with three commands: xrefs, ids, and test-concord
Comprehensive test suite with 80 tests split between unit tests (mocked) and integration tests (real network calls)

Reviewed changes

Copilot reviewed 15 out of 19 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
pyproject.toml	Project configuration with dependencies (click, duckdb, requests, tqdm) and pytest markers
.python-version	Specifies Python 3.11 requirement
.gitignore	Excludes /data directory for downloaded files
README.md	User documentation with setup, usage examples, and testing instructions
CLAUDE.md	AI assistant guidance documentation (contains outdated wget reference)
src/babel_explorer/cli.py	Click-based CLI with xrefs, ids, and test-concord commands
src/babel_explorer/core/downloader.py	Streaming file downloader with MD5 validation and resume capability
src/babel_explorer/core/nodenorm.py	NodeNorm API client for identifier normalization
src/babel_explorer/core/babel_xrefs.py	DuckDB-based cross-reference query engine (has frozen dataclass bug)
tests/conftest.py	Session-scoped pytest fixtures for shared test resources
tests/constants.py	Shared test constants and CURIE loader utility
tests/data/valid_curies.txt	Parametrized test data (one CURIE)
tests/test_downloader.py	26 tests for BabelDownloader (22 unit, 3 integration, 1 slow)
tests/test_nodenorm.py	23 tests for NodeNorm (18 unit, 5 integration)
tests/test_babel_xrefs.py	31 tests for BabelXRefs (22 unit, 8 integration, 1 slow)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-17T23:15:11Z

src/babel_explorer/core/nodenorm.py

+        return result[curie]
+
+    @functools.lru_cache(maxsize=None)
+    def get_clique_identifiers(self, curie, **kwargs):
+        result = self.normalize_curie(curie, **kwargs)


The normalize_curie method can raise a KeyError if the API response doesn't contain the requested curie as a key. This happens when the curie is not found or invalid. Based on the test at line 295 in tests/test_nodenorm.py, the expected behavior is to return None for nonexistent CURIEs. Add error handling to catch KeyError and return None instead.

Suggested change

return result[curie]

@functools.lru_cache(maxsize=None)

def get_clique_identifiers(self, curie, **kwargs):

result = self.normalize_curie(curie, **kwargs)

try:

return result[curie]

except KeyError:

logging.debug(f"NodeNorm response did not contain CURIE {curie!r}; returning None")

return None

@functools.lru_cache(maxsize=None)

def get_clique_identifiers(self, curie, **kwargs):

result = self.normalize_curie(curie, **kwargs)

if not result:

return None

Copilot · 2026-02-17T23:15:12Z

src/babel_explorer/core/downloader.py

+        if local_path is None:
+            # Default to using TMPDIR.
+            # TODO: replace with a real temporary directory.
+            tmpdir = os.environ.get("TMPDIR")
+            if tmpdir:
+                local_path = tmpdir
+
+        # Make sure the local path is an existing directory or that we can create it.
+        if not os.path.exists(local_path):
+            os.makedirs(local_path, exist_ok=True)
+            self.local_path = local_path
+        elif os.path.exists(local_path) and os.path.isdir(local_path):
+            self.local_path = local_path
+        else:
+            raise ValueError(f"Invalid local_path (must be an existing directory): '{local_path}'")


If local_path is None and TMPDIR is not set in the environment, local_path remains None. This will cause the os.path.exists(local_path) check on line 30 to fail with a TypeError. Add a fallback to a default temporary directory or raise a clear error message if no path can be determined.

Copilot · 2026-02-17T23:15:12Z

src/babel_explorer/cli.py

+    for curie in curies:
+        identifiers = nodenorm.get_clique_identifiers(curie)
+        for identifier in identifiers:
+            if identifier.label:
+                print(f"{curie}\t{identifier.curie}\t{identifier.label}\t{identifier.biolink_type}")
+            else:
+                print(f"{curie}\t{identifier.curie}\t\t{identifier.biolink_type}")


The test_concord command can fail with an AttributeError if nodenorm.get_clique_identifiers returns None (which happens when the API response doesn't contain 'equivalent_identifiers'). Add a check to handle None return values before iterating.

Copilot · 2026-02-17T23:15:12Z

pyproject.toml

+[project]
+name = "babel-explorer"
+version = "0.1.0"
+description = "Add your description here"


The project description is a placeholder: "Add your description here". It should be updated to something meaningful, such as "Tool for querying and exploring Babel intermediate files" (which is used in the README).

Suggested change

description = "Add your description here"

description = "Tool for querying and exploring Babel intermediate files"

Copilot · 2026-02-17T23:15:12Z

src/babel_explorer/core/downloader.py

+    def _calculate_md5(self, file_path, chunk_size=1024*1024):
+        """
+        Calculate MD5 checksum of a file.
+
+        Args:
+            file_path: Path to the file to checksum
+            chunk_size: Size of chunks to read (default 1MB)
+
+        Returns:
+            str: Hexadecimal MD5 checksum
+        """
+        md5_hash = hashlib.md5()
+        with open(file_path, 'rb') as f:
+            for chunk in iter(lambda: f.read(chunk_size), b''):
+                md5_hash.update(chunk)
+        return md5_hash.hexdigest()


MD5 is used for checksum validation but is cryptographically weak. While MD5 is acceptable for detecting accidental file corruption, it's vulnerable to intentional tampering. For better security, consider upgrading to SHA-256. If MD5 must be used for compatibility with existing .md5 files, document this limitation.

Copilot · 2026-02-17T23:15:12Z

src/babel_explorer/core/babel_xrefs.py

+        self.obj_biolink_type = obj_biolink_type
+
+    def __str__(self):
+        return f"""LabeledCrossReference(subj="{self.subj}", pred="{self.pred}", obj="{self.obj}", subj_label="{self.subj_label}", obj_label="{self.obj_label}", subj_label="{self.subj_label}", obj_label="{self.obj_label}")"""


The str method has duplicated parameters. The string contains "subj_label" and "obj_label" twice each. It should include "subj_biolink_type" and "obj_biolink_type" instead of the duplicate labels.

Suggested change

return f"""LabeledCrossReference(subj="{self.subj}", pred="{self.pred}", obj="{self.obj}", subj_label="{self.subj_label}", obj_label="{self.obj_label}", subj_label="{self.subj_label}", obj_label="{self.obj_label}")"""

return f"""LabeledCrossReference(subj="{self.subj}", pred="{self.pred}", obj="{self.obj}", subj_label="{self.subj_label}", subj_biolink_type="{self.subj_biolink_type}", obj_label="{self.obj_label}", obj_biolink_type="{self.obj_biolink_type}")"""

Copilot · 2026-02-17T23:15:13Z

src/babel_explorer/core/babel_xrefs.py

+        self.subj_label = subj_label
+        self.subj_biolink_type = subj_biolink_type
+        self.obj_label = obj_label
+        self.obj_biolink_type = obj_biolink_type


LabeledCrossReference extends the frozen dataclass CrossReference and attempts to add new attributes by assigning them in init. This violates the frozen=True constraint and will raise a FrozenInstanceError at runtime. The class should be defined as a frozen dataclass with all fields declared, or CrossReference should not be frozen.

Suggested change

self.subj_label = subj_label

self.subj_biolink_type = subj_biolink_type

self.obj_label = obj_label

self.obj_biolink_type = obj_biolink_type

object.__setattr__(self, "subj_label", subj_label)

object.__setattr__(self, "subj_biolink_type", subj_biolink_type)

object.__setattr__(self, "obj_label", obj_label)

object.__setattr__(self, "obj_biolink_type", obj_biolink_type)

Copilot · 2026-02-17T23:15:13Z

src/babel_explorer/core/nodenorm.py

+        self.nodenorm_url = nodenorm_url
+
+    @functools.lru_cache(maxsize=None)
+    def get_identifier(self, curie):


Missing type hint for the curie parameter. Other similar methods in the codebase have type hints (e.g., normalize_curie has "curie: str"). Add type hint for consistency.

Suggested change

def get_identifier(self, curie):

def get_identifier(self, curie: str):

Copilot · 2026-02-17T23:15:13Z

CLAUDE.md

+   - Downloads Babel intermediate files from a remote server using `wget`
+   - Caches files locally in configurable directory (default: `data/2025nov19/`)
+   - Uses `@functools.lru_cache` to avoid re-downloading
+   - **Important**: Requires `wget` to be installed on the system


The CLAUDE.md documentation states that BabelDownloader "Downloads Babel intermediate files from a remote server using wget" and "Requires wget to be installed on the system" (lines 75 and 78), but the actual implementation in downloader.py uses Python's requests library with streaming downloads, not wget. The documentation should be updated to accurately reflect the implementation.

Suggested change

- Downloads Babel intermediate files from a remote server using `wget`

- Caches files locally in configurable directory (default: `data/2025nov19/`)

- Uses `@functools.lru_cache` to avoid re-downloading

- **Important**: Requires `wget` to be installed on the system

- Downloads Babel intermediate files from a remote HTTP(S) server using Python's `requests` library (streaming downloads)

- Caches files locally in configurable directory (default: `data/2025nov19/`)

- Uses `@functools.lru_cache` to avoid re-downloading

- **Important**: Requires network access but no external tools like `wget`

gaurav and others added 19 commits December 2, 2025 15:38

This initializes a uv package in this repository.

fcc27c2

Added basic CLI.

876353d

Add /data to the .gitignore.

ec1d1f0

Initial implementation of a basic xref query-er.

eff8f26

Added a method to look up a particular identifier.

4d04e2a

Added CURIE expansion/recursive lookup.

8531cb7

Added a basic ConcordTester.

a1aeec6

Added labels via NodeNorm.

bb1eb99

Midnight commit: attempting to improve expansion.

40c3338

Added some improvements.

8c41112

Added a CLAUDE.md by Claude.ai.

239c89f

Reorganized file slightly.

8132fe1

Claude wrote some tests.

bd00972

Improved downloader using Claude.

9cc06bc

Added MD5 download functionality.

da8bb0c

Removed empty model file.

8f36b74

Attempted to rename this package to babel-explorer.

0534fd8

Merge branch 'main' into basic-implementation-in-uv

8535202

gaurav requested a review from Copilot February 17, 2026 23:11

Copilot started reviewing on behalf of gaurav February 17, 2026 23:11 View session

Copilot AI reviewed Feb 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic implementation in uv#1

Basic implementation in uv#1
gaurav wants to merge 19 commits intomainfrom
basic-implementation-in-uv

gaurav commented Dec 3, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 17, 2026

Uh oh!

Copilot AI Feb 17, 2026

Uh oh!

Copilot AI Feb 17, 2026

Uh oh!

Copilot AI Feb 17, 2026

Uh oh!

Copilot AI Feb 17, 2026

Uh oh!

Copilot AI Feb 17, 2026

Uh oh!

Copilot AI Feb 17, 2026

Uh oh!

Copilot AI Feb 17, 2026

Uh oh!

Copilot AI Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

	description = "Add your description here"
	description = "Tool for querying and exploring Babel intermediate files"

	return f"""LabeledCrossReference(subj="{self.subj}", pred="{self.pred}", obj="{self.obj}", subj_label="{self.subj_label}", obj_label="{self.obj_label}", subj_label="{self.subj_label}", obj_label="{self.obj_label}")"""
	return f"""LabeledCrossReference(subj="{self.subj}", pred="{self.pred}", obj="{self.obj}", subj_label="{self.subj_label}", subj_biolink_type="{self.subj_biolink_type}", obj_label="{self.obj_label}", obj_biolink_type="{self.obj_biolink_type}")"""

	def get_identifier(self, curie):
	def get_identifier(self, curie: str):

Conversation

gaurav commented Dec 3, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments