Skip to content

Publish EC drug and disease lists to HuggingFace Hub (#XDATA-282)#2073

Merged
matentzn merged 14 commits intomainfrom
issueXDATA-282
Feb 23, 2026
Merged

Publish EC drug and disease lists to HuggingFace Hub (#XDATA-282)#2073
matentzn merged 14 commits intomainfrom
issueXDATA-282

Conversation

@matentzn
Copy link
Copy Markdown
Collaborator

@matentzn matentzn commented Feb 10, 2026

Following @JacquesVergine's feedback, the drug list, disease list, and KG publication pipelines are now fully decoupled. Each can be run independently without cross-project dependencies.

What changed

New Kedro pipelines in core_entities:

  • publish_drug_list_hf — publishes the EC drug list to HuggingFace (passthrough)
  • publish_disease_list_hf — publishes the disease list to HuggingFace (merges the full Mondo disease list with EC curated data, drops internal columns)

HF disease list is now richer: Instead of just the curated EC diseases, the HF disease list left-joins ALL Mondo diseases with the EC enrichment data (specialties, prevalence, categories, etc.). EC values take precedence where both lists have data for the same disease.

Separate GHA workflow (core_entitities_publish_release_hf.yaml): HF publication is a manual-dispatch workflow, independent of the existing GCS/BQ release workflow. This means HF releases don't happen automatically on every internal release — you trigger them explicitly when ready.

Cleaned up matrix pipeline: Removed drug/disease list nodes from data_publication in the matrix pipeline. It now only handles KG publication.

Open question

The separate HF workflow means you manually trigger it after a GCS/BQ release. An alternative would be to couple HF publication to non-patch releases (v1.1, v1.2, etc.) automatically. Happy to wire that up if preferred — @JacquesVergine what do you think?

How to run

export HF_TOKEN="hf_***"

# From pipelines/core_entities/
# Publish drug list to HF
PIPELINE_NAME=drug_list RELEASE_VERSION=v1.2.9 \
  uv run kedro run -e cloud --pipeline=publish_drug_list_hf

# Publish disease list to HF
PIPELINE_NAME=disease_list RELEASE_VERSION=v1.1.3 \
  uv run kedro run -e cloud --pipeline=publish_disease_list_hf
  • I have tested the pipeline with the code above locally

Add drug/disease list publication to the data_publication pipeline,
reading versioned parquet files from GCS and writing to everycure/drug-list
and everycure/disease-list on HF Hub. The disease list drops six internal
columns before publishing. HF publication is triggered automatically from
the core_entities release CI on minor/major releases (patches are skipped).
@matentzn matentzn added the enhancement improving an existing system or feature to work better. label Feb 10, 2026
Update public data releases page to refer to 'datasets' and broaden examples (knowledge graphs, drug lists, disease lists). Add drug-list and disease-list entries to the Available Datasets table with Hugging Face links and Docs column, adjust kg-nodes/kg-edges rows, and change 'Both datasets' to 'All datasets'.
Comment thread pipelines/matrix/conf/base/data_publication/catalog.yml Outdated
Comment thread pipelines/matrix/conf/base/data_publication/catalog.yml Outdated
This was requested by @JacquesVergine: instead of heaving a single data publication pipeline, you have individual pipelines for KG, drug and disease list!
The HuggingFace disease list publication previously only included
curated diseases. It now left-joins the full Mondo disease list with
the EC release list, so all Mondo diseases are present with EC
enrichment data (specialties, prevalence, categories, etc.) where
available. For overlapping columns, EC values take precedence.
@matentzn matentzn marked this pull request as ready for review February 17, 2026 14:19
@matentzn matentzn requested a review from a team as a code owner February 17, 2026 14:19
@matentzn matentzn changed the title WIP: Publish EC drug and disease lists to HuggingFace Hub (#XDATA-282) Publish EC drug and disease lists to HuggingFace Hub (#XDATA-282) Feb 17, 2026
Added clarification on the exclusion of certain columns from the public HF release due to their experimental nature.
Remove tests that targeted the internal _drop_disease_hf_columns helper and explicit publish_drug_list_node/publish_disease_list_node behaviors. Keep a single higher-level test that asserts the pipeline contains the expected node names (publish_kg_edges_node, publish_kg_nodes_node). This reduces coupling to implementation details and focuses the test on pipeline composition.
Comment thread .github/workflows/core_entitities_publish_release_hf.yaml Outdated
@JacquesVergine JacquesVergine removed the request for review from leelancashire February 18, 2026 11:52
@JacquesVergine
Copy link
Copy Markdown
Collaborator

Regarding your open question, I think I would go with the manually triggered one. We can automate it easily in the github action itself.

Copy link
Copy Markdown
Collaborator

@JacquesVergine JacquesVergine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but wait for May to have a look at the phrasing before merging

Comment thread .github/workflows/core_entitities_publish_release_hf.yaml Outdated
Comment thread docs/src/pipeline/data/drug_disease_lists.md Outdated
Comment thread docs/src/releases/public_data_releases.md Outdated
…#2080)

* Add Git tag support to HFIterableDataset for version-coupled releases

HFIterableDataset now accepts an optional `tag` parameter. After
push and verification, it creates a Git tag on the HuggingFace Hub
repo pinned to the exact commit SHA of the upload. This couples
internal EC release versions (e.g. v1.1.0) with HF dataset versions,
so users can load a specific release with
`load_dataset("everycure/disease-list", revision="v1.1.0")`.

* Move the HF pipeline into the kedro cloud space (from base)
@matentzn matentzn merged commit 204cfd0 into main Feb 23, 2026
22 checks passed
@matentzn matentzn deleted the issueXDATA-282 branch February 23, 2026 17:02
eKathleenCarter pushed a commit that referenced this pull request Feb 24, 2026
)

* Publish EC drug and disease lists to HuggingFace Hub (#XDATA-282)

Add drug/disease list publication to the data_publication pipeline,
reading versioned parquet files from GCS and writing to everycure/drug-list
and everycure/disease-list on HF Hub. The disease list drops six internal
columns before publishing. HF publication is triggered automatically from
the core_entities release CI on minor/major releases (patches are skipped).

* Add drug/disease datasets and clarify releases

Update public data releases page to refer to 'datasets' and broaden examples (knowledge graphs, drug lists, disease lists). Add drug-list and disease-list entries to the Available Datasets table with Hugging Face links and Docs column, adjust kg-nodes/kg-edges rows, and change 'Both datasets' to 'All datasets'.

* Refactor hf pipeline into separate publication pipelines

This was requested by @JacquesVergine: instead of heaving a single data publication pipeline, you have individual pipelines for KG, drug and disease list!

* Add publish to huggingface core entities release pipeline

* HF disease list now merges full Mondo list with EC curated data (#2079)

The HuggingFace disease list publication previously only included
curated diseases. It now left-joins the full Mondo disease list with
the EC release list, so all Mondo diseases are present with EC
enrichment data (specialties, prevalence, categories, etc.) where
available. For overlapping columns, EC values take precedence.

* Remove now unnecessary drop_disease_hf_columns method

* Clarify exclusion of columns in HF dataset release

Added clarification on the exclusion of certain columns from the public HF release due to their experimental nature.

* Simplify data_publication pipeline tests

Remove tests that targeted the internal _drop_disease_hf_columns helper and explicit publish_drug_list_node/publish_disease_list_node behaviors. Keep a single higher-level test that asserts the pipeline contains the expected node names (publish_kg_edges_node, publish_kg_nodes_node). This reduces coupling to implementation details and focuses the test on pipeline composition.

* Add Git tag support to HFIterableDataset for version-coupled releases (#2080)

* Add Git tag support to HFIterableDataset for version-coupled releases

HFIterableDataset now accepts an optional `tag` parameter. After
push and verification, it creates a Git tag on the HuggingFace Hub
repo pinned to the exact commit SHA of the upload. This couples
internal EC release versions (e.g. v1.1.0) with HF dataset versions,
so users can load a specific release with
`load_dataset("everycure/disease-list", revision="v1.1.0")`.

* Move the HF pipeline into the kedro cloud space (from base)

* Rephrase hugging face disease list docs

* Add approximate number of diseases in disease list release
eKathleenCarter added a commit that referenced this pull request Mar 11, 2026
* Modify outputs from connectivity metrics pipeline

* clean up

* Publish EC drug and disease lists to HuggingFace Hub (#XDATA-282) (#2073)

* Publish EC drug and disease lists to HuggingFace Hub (#XDATA-282)

Add drug/disease list publication to the data_publication pipeline,
reading versioned parquet files from GCS and writing to everycure/drug-list
and everycure/disease-list on HF Hub. The disease list drops six internal
columns before publishing. HF publication is triggered automatically from
the core_entities release CI on minor/major releases (patches are skipped).

* Add drug/disease datasets and clarify releases

Update public data releases page to refer to 'datasets' and broaden examples (knowledge graphs, drug lists, disease lists). Add drug-list and disease-list entries to the Available Datasets table with Hugging Face links and Docs column, adjust kg-nodes/kg-edges rows, and change 'Both datasets' to 'All datasets'.

* Refactor hf pipeline into separate publication pipelines

This was requested by @JacquesVergine: instead of heaving a single data publication pipeline, you have individual pipelines for KG, drug and disease list!

* Add publish to huggingface core entities release pipeline

* HF disease list now merges full Mondo list with EC curated data (#2079)

The HuggingFace disease list publication previously only included
curated diseases. It now left-joins the full Mondo disease list with
the EC release list, so all Mondo diseases are present with EC
enrichment data (specialties, prevalence, categories, etc.) where
available. For overlapping columns, EC values take precedence.

* Remove now unnecessary drop_disease_hf_columns method

* Clarify exclusion of columns in HF dataset release

Added clarification on the exclusion of certain columns from the public HF release due to their experimental nature.

* Simplify data_publication pipeline tests

Remove tests that targeted the internal _drop_disease_hf_columns helper and explicit publish_drug_list_node/publish_disease_list_node behaviors. Keep a single higher-level test that asserts the pipeline contains the expected node names (publish_kg_edges_node, publish_kg_nodes_node). This reduces coupling to implementation details and focuses the test on pipeline composition.

* Add Git tag support to HFIterableDataset for version-coupled releases (#2080)

* Add Git tag support to HFIterableDataset for version-coupled releases

HFIterableDataset now accepts an optional `tag` parameter. After
push and verification, it creates a Git tag on the HuggingFace Hub
repo pinned to the exact commit SHA of the upload. This couples
internal EC release versions (e.g. v1.1.0) with HF dataset versions,
so users can load a specific release with
`load_dataset("everycure/disease-list", revision="v1.1.0")`.

* Move the HF pipeline into the kedro cloud space (from base)

* Rephrase hugging face disease list docs

* Add approximate number of diseases in disease list release

* Update catalog.yml (#2088)

* Update litellm targetRevision and image tag to v1.81.12-stable (#2087)

* Implement Ontology inclusion metric

* Update pipelines/matrix/src/matrix/pipelines/integration/connectivity_metrics.py

Co-authored-by: Jacques Vergine <jacques.vergine35@gmail.com>

---------

Co-authored-by: Nico Matentzoglu <nicolas.matentzoglu@gmail.com>
Co-authored-by: Nelson Alfonso <45660392+Dashing-Nelson@users.noreply.github.com>
Co-authored-by: Jacques Vergine <jacques.vergine35@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement improving an existing system or feature to work better.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants