BioETL is a robust, scalable data engineering framework designed to acquire, normalize, and process bioactivity data from major public repositories (ChEMBL, PubChem, UniProt, etc.) into a unified, analysis-ready Delta Lake warehouse.
- Medallion Architecture: Structured data flow (Bronze -> Silver -> Gold) ensuring data quality and traceability.
- Delta Lake Core: ACID transactions, schema enforcement, and time travel capabilities.
- Resilience: Built-in circuit breakers, exponential backoff retries, and dead-letter queues (Quarantine).
- Local-First Design: In-memory locking, local file storage -- no external services required (ADR-010).
- Deterministic Writes: Reproducible outputs and deterministic retries (ADR-014).
- Run Control Plane: Immutable run manifests and append-only ledgers for provenance, replay analysis, and artifact linkage (ADR-044).
- Observability by Design: Metrics, tracing, and logging ports (ADR-017).
- Unified HTTP Client: Standardized rate limiting, retry, and telemetry (ADR-032).
- Strict Governance: Comprehensive rules for schema evolution, data contracts, and operational procedures.
BioETL follows Hexagonal Architecture (Ports & Adapters) with Domain-Driven Design patterns:
┌─────────────────────────────────────────────────────────────┐
│ INTERFACES (CLI) │
├─────────────────────────────────────────────────────────────┤
│ COMPOSITION (DI) │
│ bootstrap_pipeline_runner() → Factories │
├─────────────────────────────────────────────────────────────┤
│ APPLICATION │
│ PipelineRunner → Executor → BaseTransformer │
├─────────────────────────────────────────────────────────────┤
│ DOMAIN (DDD) │
│ Ports │ Aggregates │ Value Objects │ Entities │ Schemas │
├─────────────────────────────────────────────────────────────┤
│ INFRASTRUCTURE │
│ ChEMBL │ PubChem │ UniProt │ Delta Lake │ Observability │
└─────────────────────────────────────────────────────────────┘
Data Flow: External API -> Bronze (JSONL+zstd) -> Silver (Delta Lake) -> Gold (Analytics)
The domain layer implements Domain-Driven Design patterns:
| Component | Description |
|---|---|
| Ports | Protocol interfaces for dependency inversion (domain/ports/) |
| Aggregates | Domain aggregates with invariant protection (domain/aggregates/) |
| Value Objects | Immutable domain primitives (domain/value_objects/) |
| Entities | Domain entities per provider (domain/entities/) |
| Schemas | Pandera DataFrameModel schemas for dataframe validation (domain/schemas/) |
| Provider | Entity Types | Status | Rate Limit |
|---|---|---|---|
| ChEMBL | Activity, Assay, Molecule, Target, Target Component, Protein Class, Cell Line, Compound Record, Publication, Publication Term/Similarity, Subcellular Fraction, Tissue | Production | 3 req/sec |
| PubChem | Compound | Production | 5 req/sec |
| UniProt | Protein | Production | 10 req/sec (100 req/sec with API key) |
| UniProt ID Mapping | ID Mapping | Production | Local job / no external rate limit |
| PubMed | Publication | Production | 3 req/sec (10 req/sec with API key) |
| CrossRef | Publication | Production | Polite pool |
| OpenAlex | Publication | Production | ~10 req/sec |
| Semantic Scholar | Publication | Production | 0.1 req/sec (1 req/sec with API key) |
| Document | Description |
|---|---|
| API Reference | Full API documentation with mkdocstrings |
| Architecture Decisions | 45 ADRs explaining design choices |
| Ubiquitous Language | Domain terminology and canonical naming |
| RULES.md | Canonical active governance and requirements |
| Project Map | Primary navigator for active project docs |
| Tools Hub | Current tool entry points and placement rules |
| CLI Reference | Command-line interface documentation |
| Run Manifest Contract | Published control-plane manifest and ledger schema |
| Operations Runbooks | Incident response and procedures |
| Archive Index | Historical context only; not normative |
Start with Project Map, RULES.md,
and Tools Hub for current guidance. Materials under
docs/99-archive/ are preserved for traceability,
but active docs in docs/00-05 remain the source of truth.
| Path | Role | Orientation |
|---|---|---|
src/bioetl/ |
Runtime source tree organized by the five-layer architecture | Source Map |
configs/ |
Provider, entity, composite, contract, and quality configuration assets | configs/README.md |
tests/ |
Unit, integration, e2e, smoke, contract, security, performance, and architecture verification | tests/ mirrors source concerns by scope and policy surface |
docs/ |
Published documentation tree: canonical active docs plus selected extended mirrors | Start at Project Map |
docs/reports/ |
Repo-only curated evidence and report artifacts (not published in MkDocs) | docs/reports/index.md |
reports/ |
Generated or working analysis outputs before curation | reports/README.md |
scripts/ |
Canonical tooling by domain plus a small set of compatibility wrappers at repo root | scripts/README.md |
The current top-level layout is intentionally stable. Structural improvements should usually target a specific family or navigation seam rather than trigger a repo-wide reorganization wave.
- Python: Version 3.11 or higher.
- Make: For running automation commands.
- uv: Recommended package manager (install).
- Docker: Optional, only for
docker-composeextras such as Neo4j and monitoring; not required for the Local-Only runtime. - Node.js: Optional, for Mermaid diagram rendering and related docs tooling.
Use the maintained Make targets for local bootstrap:
git clone https://github.com/SatoryKono/BioactivityDataAcquisition2.git
cd BioactivityDataAcquisition2
make install
make test-deps
make setup-plugins
make setup-skillsNotes:
make installusesuv sync --extra dev --extra tracingwhenuvis available; otherwise it creates.venvand installs the editable package with dev extras.- Documentation site commands such as
make docs-buildrequire the separatedocsextra:uv sync --extra dev --extra tracing --extra docsorpip install -e ".[dev,tracing,docs]". make setup-pluginsconfigures local pytest/pre-commit tooling.make setup-skillssyncs repository-local Codexskillsand their pairedagentsinto$CODEX_HOME(default~/.codex).- If you use Codex or GitHub Copilot MCP, run
uv run python -m scripts.dev setup-mcpafter install. If you activated the OS-appropriate environment instead of usinguv,python -m scripts.dev setup-mcpis also valid. scripts/dev/dev_setup.shis currently a legacy placeholder and is not the supported onboarding path.
If you use the same checkout from both Windows PowerShell and WSL, keep the
virtual environments separate. A Linux .venv is not valid in PowerShell, and
a Windows .venv is not valid in WSL.
Use:
.\scripts\dev\setup_env_windows.ps1bash scripts/dev/setup_env_wsl.shThis creates:
.venv-win # Windows PowerShell
$HOME/.venvs/bioetl # WSL/Linux by default
Then use the OS-specific wrappers:
.\scripts\dev\run_pytest.ps1 tests\ --timeout=120 -n 4 --lf
.\scripts\dev\run_mypy.ps1bash scripts/dev/run_pytest.sh tests/ --timeout=120 -n 4 --lf
bash scripts/dev/run_mypy.sh-
Clone and Install: Initialize the virtual environment and install project dependencies.
git clone https://github.com/SatoryKono/BioactivityDataAcquisition2.git cd BioactivityDataAcquisition2 # Preferred manual path uv sync --extra dev --extra tracing # Add --extra docs if you need MkDocs/site builds uv sync --extra dev --extra tracing --extra docs # Fallback without uv python3 -m venv .venv . .venv/bin/activate pip install -e ".[dev,tracing,docs]"
-
Configure Environment (optional): Copy the example configuration if you need API keys for providers.
cp .env.example .env
Note: Secrets follow the pattern
BIOETL_{PROVIDER}_{KEY}. For local development, the defaults are usually sufficient.Environment Variables:
Variable Description Default Core BIOETL_ENVEnvironment ( dev/staging/prod)devBIOETL_DATA_DIRBase directory for Bronze/Silver/Gold data dataBIOETL_DEBUGEnable debug features falseBIOETL_TEST_MODEUse fixtures instead of real APIs falsePipeline BIOETL_PIPELINE__BATCH_SIZERecords per batch write (1–10000) 100BIOETL_PIPELINE__CHECKPOINT_INTERVALSave checkpoint every N records (≥100) 1000BIOETL_PIPELINE__MAX_CONCURRENT_BATCHESMax concurrent batch writes (1–16) 4BIOETL_PIPELINE__HEARTBEAT_INTERVALLock heartbeat interval in seconds (5–60) 30Provider API Keys BIOETL_UNIPROT_API_KEYUniProt API key (higher rate limits) — BIOETL_PUBMED_API_KEYNCBI E-utilities API key — BIOETL_PUBMED_EMAILEmail for NCBI tool identification — BIOETL_OPENALEX_EMAILEmail for OpenAlex polite pool — BIOETL_SEMANTICSCHOLAR_API_KEYSemantic Scholar API key — BIOETL_CROSSREF_EMAILEmail for Crossref polite pool — Security BIOETL_PII_SALT_CURRENTSalt for PII hashing (≥32 chars, required in prod) — BIOETL_PII_SALT_NEXTNext salt for rotation — BIOETL_SALT_ROTATION_ACTIVEWhether salt rotation is active falseObservability BIOETL_LOG_LEVELLogging level ( DEBUG/INFO/WARNING/ERROR/CRITICAL)INFOBIOETL_LOG_FORMATLog format ( json/text)jsonBIOETL_LOG_FILELog file path logs/bioetl.logBIOETL_METRICS_ENABLEDEnable Prometheus metrics trueBIOETL_METRICS_PORTPrometheus HTTP server port 8000BIOETL_OBSERVABILITY__TRACING_ENABLEDEnable OpenTelemetry tracing falseBIOETL_OBSERVABILITY__DQ_MONITOR_ENABLEDEnable data quality monitoring falseData Quality BIOETL_DQ_SOFT_THRESHOLDWarning error rate threshold 0.05BIOETL_DQ_HARD_THRESHOLDFail batch error rate threshold 0.20Resilience BIOETL_CB_FAILURE_THRESHOLDConsecutive errors to open circuit breaker 5BIOETL_CB_RECOVERY_TIMEOUTCircuit breaker recovery timeout (seconds) 300BIOETL_RETRY_MAX_ATTEMPTSMaximum retry attempts 3BIOETL_RETRY_MULTIPLIERExponential backoff multiplier 2.0Delta Lake BIOETL_DELTA_VACUUM_RETENTIONVACUUM retention (days) 7BIOETL_DELTA_FORENSIC_RETENTIONForensic retention (days) 7Quarantine BIOETL_QUARANTINE_RETENTION_DAYSQuarantine record retention (days) 30BIOETL_QUARANTINE_PAYLOAD_MAX_SIZEMax payload size (bytes) 65536See
.env.examplefor the full list with comments. -
Verify Installation: Run tests to ensure everything works.
make lint && make test
Note: BioETL uses local file storage by default (
data/directory). No Docker or external services required. See Local Storage Layout and ADR-010 for details.
Use the OS-appropriate bootstrap path first:
# Windows PowerShell
.\scripts\dev\setup_env_windows.ps1
.\.venv-win\Scripts\Activate.ps1# WSL/Linux
bash scripts/dev/setup_env_wsl.sh
source "${BIOETL_WSL_VENV_DIR:-$HOME/.venvs/bioetl}/bin/activate"Then run the ETL pipeline using the CLI:
# Run incremental update for ChEMBL
python -m bioetl run --pipeline chembl_activity --run-type incremental
# Run backfill with resume capability
python -m bioetl run --pipeline chembl_activity --run-type backfill --resume
# Inspect quarantined records
python -m bioetl quarantine inspect --pipeline chembl_activity --limit 10
# List checkpoints
python -m bioetl checkpoint list --pipeline chembl_activityIf you do not want to activate the environment, call the interpreter directly:
.\.venv-win\Scripts\python.exe -m bioetl run --pipeline chembl_publication --limit 50000"${BIOETL_WSL_VENV_DIR:-$HOME/.venvs/bioetl}/bin/python" -m bioetl run --pipeline chembl_publication --limit 50000- Do not store domain datasets or reference data files in repository root.
- Keep machine-consumed reference datasets under semantic paths in
data/(for example,data/input/reference/). - Keep optional human-facing spreadsheet copies under
docs/04-reference/schemas/when they are needed for documentation. - Unified publication classifier canonical format is CSV at
data/input/reference/unified_classification.csv; optional spreadsheet copies are non-canonical and MAY be stored in docs as needed.
Локальные диагностические файлы (например, git_commit_*.txt, *_gitshow_err.txt, log_test.txt) не должны храниться в корне репозитория и не коммитятся в Git.
- Временные диагностические дампы сохраняйте в
tmp/. - Логи локальных запусков сохраняйте в
logs/. - Для ad-hoc команд используйте явное перенаправление (
> logs/<name>.log 2>&1или> tmp/<name>.txt 2>&1).
To configure the core MCP servers for both VS Code Copilot and Codex CLI:
./scripts/dev/setup_copilot_codex_mcp.shWindows PowerShell:
.\scripts\dev\setup_copilot_codex_mcp.ps1What this script does:
- Writes workspace MCP config for Copilot at
.vscode/mcp.json. - Registers
memory,filesystem,sequential-thinking,fetch,pdf,github,docker,docker-docs,context7,paper-search,dockerhub,prometheus,grafana,brave-search, andopenaiDeveloperDocsin Codex CLI. - Uses Docker-backed wrappers for
docker,docker-docs,context7,paper-search,dockerhub,prometheus,grafana, andbrave-search. - Uses local defaults when not overridden:
PROMETHEUS_URL=http://host.docker.internal:9090GRAFANA_URL=http://host.docker.internal:3000- Grafana auth prefers
GRAFANA_SERVICE_ACCOUNT_TOKEN; otherwise it can useGRAFANA_USERNAME/GRAFANA_PASSWORD.
- Template variables for these MCP servers live in
.env.example. - Does not store real tokens in repository files.
Common MCP environment variables:
GITHUB_PERSONAL_ACCESS_TOKEN=
PROMETHEUS_URL=http://host.docker.internal:9090
GRAFANA_URL=http://host.docker.internal:3000
GRAFANA_SERVICE_ACCOUNT_TOKEN=
BRAVE_API_KEY=
DOCKERHUB_USERNAME=
HUB_PAT_TOKEN=Before using GitHub MCP tools, set a token in your shell:
export GITHUB_PERSONAL_ACCESS_TOKEN="<your_pat>"On Windows, the project wrapper .claude/github-mcp-wrapper.ps1 can auto-read token from gh auth token when available.
Cursor uses the same workspace tasks as VS Code. This repository includes two Codex tasks:
BioETL: Codex interactive (WSL)— starts interactive Codex in WSL.BioETL: Codex exec full-auto (WSL)— prompts for a task string and runscodex exec --full-auto.
How to run:
- Open Command Palette (
Ctrl+Shift+P). - Run
Tasks: Run Task. - Pick one of the
BioETL: Codex ...tasks.
For one-click IDE launch, use Run and Debug configurations:
BioETL: Codex interactive (WSL)BioETL: Codex exec full-auto (WSL)
How to run:
- Open
Run and Debug(Ctrl+Shift+D). - Select one of the
BioETL: Codex ...configurations. - Press
F5.
The project uses pytest for testing, split into Unit, Integration, and Architecture tests.
-
Setup Plugins (pytest + pre-commit):
make setup-plugins
This command validates required pytest plugins and installs pre-commit hooks.
-
Quick Check (with dependencies auto-synced and coverage):
bash scripts/dev/run_pytest.sh
Windows PowerShell:
.\scripts\dev\run_pytest.ps1
The helpers assume you already bootstrapped the OS-appropriate environment with
make install/make setup-pluginsorscripts/dev/setup_env_windows.ps1/scripts/dev/setup_env_wsl.sh. By default they runpytestwith--cov=src/bioetl --cov-report=term -q --maxfail=1.bash scripts/dev/run_pytest.shalso callsbash scripts/ops/setup_plugins.sh --pytest-onlybefore execution, so it can self-heal missing pytest plugins in WSL/Linux..\scripts\dev\run_pytest.ps1does not perform that bootstrap step and expects.venv-winto be prepared already.If you prefer to run the command manually, activate the OS-appropriate virtual environment first to avoid
--covargument errors:source "${BIOETL_WSL_VENV_DIR:-$HOME/.venvs/bioetl}/bin/activate" # dev already includes pytest, pytest-cov, pytest-xdist, pytest-timeout, VCR, etc. pip install -e ".[dev,tracing]" python -m pytest tests --cov=src/bioetl --cov-report=term
Windows PowerShell:
.\.venv-win\Scripts\Activate.ps1 python -m pytest tests\ --cov=src/bioetl --cov-report=term
With
uv, the equivalent is:uv sync --extra dev --extra tracing uv run python -m pytest tests --cov=src/bioetl --cov-report=term
To include tracing and pre-commit plugin setup:
uv sync --extra dev --extra tests --extra tracing uv run python -m pre_commit install --install-hooks
Если
pytestсообщает об отсутствии обязательных плагинов (pytest-asyncio,pytest-cov,pytest-xdist,pytest-timeout,pytest-vcr), выполните повторную синхронизацию:uv sync --extra dev --extra tests --extra tracing
Скрипт
bash scripts/dev/run_pytest.shпроверяет наличие плагинов и автоматически доустанавливает их при необходимости. -
Run All Tests:
make test -
Run Unit Tests Only (Fast, no I/O):
make test-unit
-
Run Integration Tests (Uses VCR.py cassettes, no network required):
make test-integration
-
Run Architecture Tests:
make test-architecture
-
Sync project skills into Codex:
make setup-skills
This syncs local project skills from
.codex/skillsinto$CODEX_HOME/skills(default~/.codex/skills) and also keeps the paired.codex/agentstree aligned in$CODEX_HOME/agents. -
Sync only project agents into Codex:
make setup-agents
Strict quality standards are enforced using ruff, mypy, and other tools.
- Linting & Formatting:
make lint # Check only make lint-fix # Auto-fix and format
- Type Checking:
make typecheck # Strict mypy - Complexity Check:
make complexity
Build and serve local documentation:
make docs-serveAccess the docs at http://localhost:8000.
.
├── configs/ # YAML pipeline configurations
├── docs/ # Documentation (Architecture, Guides, Runbooks)
│ ├── 02-architecture/ # Layer docs, diagrams, ADRs (43 decisions)
│ ├── 00-project/
│ │ ├── glossary.md # Ubiquitous Language glossary
│ │ └── RULES.md # Project governance (v5.24)
│ └── ...
├── src/
│ └── bioetl/
│ ├── domain/ # Pure business logic (DDD), NO I/O
│ │ ├── ports/ # Protocol interfaces (Ports)
│ │ ├── aggregates/ # DDD Aggregates with invariants
│ │ ├── value_objects/ # Immutable domain primitives
│ │ ├── entities/ # Domain entities per provider
│ │ ├── schemas/ # Pydantic/Pandera validation schemas
│ │ └── exceptions/ # Classified exceptions (Critical/Recoverable/DQ)
│ ├── application/ # Pipeline orchestration & services
│ │ ├── core/ # PipelineRunner, Executor, BaseTransformer
│ │ ├── pipelines/ # ChEMBL, PubChem, UniProt, PubMed, CrossRef, OpenAlex, Semantic Scholar (+ common utilities)
│ │ └── services/ # Application services (lifecycle, vacuum, cleanup)
│ ├── composition/ # Composition Root (public seams, bootstrap, factories)
│ │ ├── bootstrap/ # Runtime and CLI bootstrap assembly
│ │ ├── factories/ # Pipeline, storage, data source, service factories
│ │ ├── providers/ # Provider registry and loading lifecycle
│ │ ├── runtime_builders/ # Leaf builders for runner inputs and observability
│ │ ├── services/ # Thin re-exports for metadata/versioning helpers
│ │ ├── entrypoints.py # Stable broad public seam
│ │ ├── execution_api.py # Narrow execution API
│ │ ├── services_api.py # Narrow services API
│ │ ├── resources_api.py # Narrow checkpoint/quarantine API
│ │ ├── composite_api.py # Composite runtime facade
│ │ └── observability_api.py # Observability facade
│ ├── infrastructure/ # Adapters (API clients, Delta Lake, Storage)
│ │ ├── adapters/ # HTTP clients with unified resilience
│ │ ├── storage/ # Bronze/Silver/Gold writers
│ │ ├── locking/ # In-memory locks (MemoryLock)
│ │ └── observability/ # Metrics, tracing, logging
│ └── interfaces/ # External interfaces
│ ├── cli/ # Click CLI commands
│ └── orchestration/ # Reserved (empty; signal handlers removed 2025-12-31, shutdown logic in application/core/shutdown.py)
├── tests/ # Unit, Integration, Architecture & E2E tests
├── scripts/ # Utility scripts (lint_terminology.py, etc.)
├── Makefile # Automation commands
└── pyproject.toml # Dependencies & Tool configuration
Repository root is protected by scripts/repo/audit_root_cleanliness.py
(pre-commit + CI job root-hygiene).
Only approved top-level entries are allowed.
Core allowed root entries:
- Source and tests:
src/,tests/ - Documentation and references:
docs/,README.md,CHANGELOG.md - Build/configuration:
pyproject.toml,uv.lock,Makefile,.pre-commit-config.yaml,.github/ - Operational/project assets:
configs/,scripts/,assets/,data/,reports/,grafana/ - Explicit exceptions listed in
.github/root-allowlist.txt
Where to place artifacts:
- Test artifacts and run reports →
reports/ - Logs and diagnostic dumps →
reports/(or nested folder by run date/provider) - Coverage artifacts (
coverage.xml,htmlcov/,.coverage*) → keep out of git, generate locally/CI only - Reference datasets and static lookup files →
docs/(documentation reference) ordata/(runtime/local data)
BioETL uses a strictly Local-Only runtime model defined by ADR-010. Active workflows use filesystem-backed checkpoints, local storage, and in-memory locking. Distributed deployment, Redis locking, and Docker-based runtime orchestration are not supported entry points for current development or operations.
Please review our Security Policy for:
- Threat model and trust boundaries
- Secret management guidelines
- Data validation architecture
- Vulnerability reporting process
Please read RULES.md before contributing.
- Ensure all tests pass:
make test - Check types and linting:
make lint - Follow the RFC 2119 keywords in requirements.
This project is licensed under the MIT License.