Publish EC drug and disease lists to HuggingFace Hub (#XDATA-282) by matentzn · Pull Request #2073 · everycure-org/matrix

matentzn · 2026-02-10T13:17:48Z

Following @JacquesVergine's feedback, the drug list, disease list, and KG publication pipelines are now fully decoupled. Each can be run independently without cross-project dependencies.

What changed

New Kedro pipelines in core_entities:

publish_drug_list_hf — publishes the EC drug list to HuggingFace (passthrough)
publish_disease_list_hf — publishes the disease list to HuggingFace (merges the full Mondo disease list with EC curated data, drops internal columns)

HF disease list is now richer: Instead of just the curated EC diseases, the HF disease list left-joins ALL Mondo diseases with the EC enrichment data (specialties, prevalence, categories, etc.). EC values take precedence where both lists have data for the same disease.

Separate GHA workflow (core_entitities_publish_release_hf.yaml): HF publication is a manual-dispatch workflow, independent of the existing GCS/BQ release workflow. This means HF releases don't happen automatically on every internal release — you trigger them explicitly when ready.

Cleaned up matrix pipeline: Removed drug/disease list nodes from data_publication in the matrix pipeline. It now only handles KG publication.

Open question

The separate HF workflow means you manually trigger it after a GCS/BQ release. An alternative would be to couple HF publication to non-patch releases (v1.1, v1.2, etc.) automatically. Happy to wire that up if preferred — @JacquesVergine what do you think?

How to run

export HF_TOKEN="hf_***"

# From pipelines/core_entities/
# Publish drug list to HF
PIPELINE_NAME=drug_list RELEASE_VERSION=v1.2.9 \
  uv run kedro run -e cloud --pipeline=publish_drug_list_hf

# Publish disease list to HF
PIPELINE_NAME=disease_list RELEASE_VERSION=v1.1.3 \
  uv run kedro run -e cloud --pipeline=publish_disease_list_hf

I have tested the pipeline with the code above locally

Add drug/disease list publication to the data_publication pipeline, reading versioned parquet files from GCS and writing to everycure/drug-list and everycure/disease-list on HF Hub. The disease list drops six internal columns before publishing. HF publication is triggered automatically from the core_entities release CI on minor/major releases (patches are skipped).

Update public data releases page to refer to 'datasets' and broaden examples (knowledge graphs, drug lists, disease lists). Add drug-list and disease-list entries to the Available Datasets table with Hugging Face links and Docs column, adjust kg-nodes/kg-edges rows, and change 'Both datasets' to 'All datasets'.

@JacquesVergine

This was requested by @JacquesVergine: instead of heaving a single data publication pipeline, you have individual pipelines for KG, drug and disease list!

The HuggingFace disease list publication previously only included curated diseases. It now left-joins the full Mondo disease list with the EC release list, so all Mondo diseases are present with EC enrichment data (specialties, prevalence, categories, etc.) where available. For overlapping columns, EC values take precedence.

Added clarification on the exclusion of certain columns from the public HF release due to their experimental nature.

Remove tests that targeted the internal _drop_disease_hf_columns helper and explicit publish_drug_list_node/publish_disease_list_node behaviors. Keep a single higher-level test that asserts the pipeline contains the expected node names (publish_kg_edges_node, publish_kg_nodes_node). This reduces coupling to implementation details and focuses the test on pipeline composition.

JacquesVergine · 2026-02-18T11:54:35Z

Regarding your open question, I think I would go with the manually triggered one. We can automate it easily in the github action itself.

JacquesVergine

Looks good to me, but wait for May to have a look at the phrasing before merging

…#2080) * Add Git tag support to HFIterableDataset for version-coupled releases HFIterableDataset now accepts an optional `tag` parameter. After push and verification, it creates a Git tag on the HuggingFace Hub repo pinned to the exact commit SHA of the upload. This couples internal EC release versions (e.g. v1.1.0) with HF dataset versions, so users can load a specific release with `load_dataset("everycure/disease-list", revision="v1.1.0")`. * Move the HF pipeline into the kedro cloud space (from base)

@JacquesVergine

) * Publish EC drug and disease lists to HuggingFace Hub (#XDATA-282) Add drug/disease list publication to the data_publication pipeline, reading versioned parquet files from GCS and writing to everycure/drug-list and everycure/disease-list on HF Hub. The disease list drops six internal columns before publishing. HF publication is triggered automatically from the core_entities release CI on minor/major releases (patches are skipped). * Add drug/disease datasets and clarify releases Update public data releases page to refer to 'datasets' and broaden examples (knowledge graphs, drug lists, disease lists). Add drug-list and disease-list entries to the Available Datasets table with Hugging Face links and Docs column, adjust kg-nodes/kg-edges rows, and change 'Both datasets' to 'All datasets'. * Refactor hf pipeline into separate publication pipelines This was requested by @JacquesVergine: instead of heaving a single data publication pipeline, you have individual pipelines for KG, drug and disease list! * Add publish to huggingface core entities release pipeline * HF disease list now merges full Mondo list with EC curated data (#2079) The HuggingFace disease list publication previously only included curated diseases. It now left-joins the full Mondo disease list with the EC release list, so all Mondo diseases are present with EC enrichment data (specialties, prevalence, categories, etc.) where available. For overlapping columns, EC values take precedence. * Remove now unnecessary drop_disease_hf_columns method * Clarify exclusion of columns in HF dataset release Added clarification on the exclusion of certain columns from the public HF release due to their experimental nature. * Simplify data_publication pipeline tests Remove tests that targeted the internal _drop_disease_hf_columns helper and explicit publish_drug_list_node/publish_disease_list_node behaviors. Keep a single higher-level test that asserts the pipeline contains the expected node names (publish_kg_edges_node, publish_kg_nodes_node). This reduces coupling to implementation details and focuses the test on pipeline composition. * Add Git tag support to HFIterableDataset for version-coupled releases (#2080) * Add Git tag support to HFIterableDataset for version-coupled releases HFIterableDataset now accepts an optional `tag` parameter. After push and verification, it creates a Git tag on the HuggingFace Hub repo pinned to the exact commit SHA of the upload. This couples internal EC release versions (e.g. v1.1.0) with HF dataset versions, so users can load a specific release with `load_dataset("everycure/disease-list", revision="v1.1.0")`. * Move the HF pipeline into the kedro cloud space (from base) * Rephrase hugging face disease list docs * Add approximate number of diseases in disease list release

@JacquesVergine

* Modify outputs from connectivity metrics pipeline * clean up * Publish EC drug and disease lists to HuggingFace Hub (#XDATA-282) (#2073) * Publish EC drug and disease lists to HuggingFace Hub (#XDATA-282) Add drug/disease list publication to the data_publication pipeline, reading versioned parquet files from GCS and writing to everycure/drug-list and everycure/disease-list on HF Hub. The disease list drops six internal columns before publishing. HF publication is triggered automatically from the core_entities release CI on minor/major releases (patches are skipped). * Add drug/disease datasets and clarify releases Update public data releases page to refer to 'datasets' and broaden examples (knowledge graphs, drug lists, disease lists). Add drug-list and disease-list entries to the Available Datasets table with Hugging Face links and Docs column, adjust kg-nodes/kg-edges rows, and change 'Both datasets' to 'All datasets'. * Refactor hf pipeline into separate publication pipelines This was requested by @JacquesVergine: instead of heaving a single data publication pipeline, you have individual pipelines for KG, drug and disease list! * Add publish to huggingface core entities release pipeline * HF disease list now merges full Mondo list with EC curated data (#2079) The HuggingFace disease list publication previously only included curated diseases. It now left-joins the full Mondo disease list with the EC release list, so all Mondo diseases are present with EC enrichment data (specialties, prevalence, categories, etc.) where available. For overlapping columns, EC values take precedence. * Remove now unnecessary drop_disease_hf_columns method * Clarify exclusion of columns in HF dataset release Added clarification on the exclusion of certain columns from the public HF release due to their experimental nature. * Simplify data_publication pipeline tests Remove tests that targeted the internal _drop_disease_hf_columns helper and explicit publish_drug_list_node/publish_disease_list_node behaviors. Keep a single higher-level test that asserts the pipeline contains the expected node names (publish_kg_edges_node, publish_kg_nodes_node). This reduces coupling to implementation details and focuses the test on pipeline composition. * Add Git tag support to HFIterableDataset for version-coupled releases (#2080) * Add Git tag support to HFIterableDataset for version-coupled releases HFIterableDataset now accepts an optional `tag` parameter. After push and verification, it creates a Git tag on the HuggingFace Hub repo pinned to the exact commit SHA of the upload. This couples internal EC release versions (e.g. v1.1.0) with HF dataset versions, so users can load a specific release with `load_dataset("everycure/disease-list", revision="v1.1.0")`. * Move the HF pipeline into the kedro cloud space (from base) * Rephrase hugging face disease list docs * Add approximate number of diseases in disease list release * Update catalog.yml (#2088) * Update litellm targetRevision and image tag to v1.81.12-stable (#2087) * Implement Ontology inclusion metric * Update pipelines/matrix/src/matrix/pipelines/integration/connectivity_metrics.py Co-authored-by: Jacques Vergine <jacques.vergine35@gmail.com> --------- Co-authored-by: Nico Matentzoglu <nicolas.matentzoglu@gmail.com> Co-authored-by: Nelson Alfonso <45660392+Dashing-Nelson@users.noreply.github.com> Co-authored-by: Jacques Vergine <jacques.vergine35@gmail.com>

matentzn added the enhancement improving an existing system or feature to work better. label Feb 10, 2026

matentzn added 2 commits February 10, 2026 15:40

Merge branch 'main' into issueXDATA-282

62d471c

matentzn commented Feb 10, 2026

View reviewed changes

Comment thread pipelines/matrix/conf/base/data_publication/catalog.yml Outdated

matentzn commented Feb 10, 2026

View reviewed changes

Comment thread pipelines/matrix/conf/base/data_publication/catalog.yml Outdated

matentzn added 2 commits February 17, 2026 12:21

Refactor hf pipeline into separate publication pipelines

5b4b3e3

This was requested by @JacquesVergine: instead of heaving a single data publication pipeline, you have individual pipelines for KG, drug and disease list!

Add publish to huggingface core entities release pipeline

db028c3

matentzn mentioned this pull request Feb 17, 2026

Publish merged version of disease list #2079

Merged

matentzn added 3 commits February 17, 2026 14:01

Merge branch 'main' into issueXDATA-282

8a01fdc

Remove now unnecessary drop_disease_hf_columns method

d8f200b

matentzn marked this pull request as ready for review February 17, 2026 14:19

matentzn requested a review from a team as a code owner February 17, 2026 14:19

matentzn requested a review from leelancashire February 17, 2026 14:19

matentzn temporarily deployed to dev February 17, 2026 14:19 — with GitHub Actions Inactive

matentzn requested a review from JacquesVergine February 17, 2026 14:19

matentzn changed the title ~~WIP: Publish EC drug and disease lists to HuggingFace Hub (#XDATA-282)~~ Publish EC drug and disease lists to HuggingFace Hub (#XDATA-282) Feb 17, 2026

matentzn temporarily deployed to dev February 17, 2026 14:19 — with GitHub Actions Inactive

Clarify exclusion of columns in HF dataset release

a19adcb

Added clarification on the exclusion of certain columns from the public HF release due to their experimental nature.

matentzn temporarily deployed to dev February 17, 2026 14:21 — with GitHub Actions Inactive

matentzn temporarily deployed to dev February 17, 2026 14:22 — with GitHub Actions Inactive

matentzn temporarily deployed to dev February 17, 2026 17:50 — with GitHub Actions Inactive

matentzn temporarily deployed to dev February 17, 2026 17:51 — with GitHub Actions Inactive

matentzn requested a review from may-lim February 17, 2026 18:36

matentzn commented Feb 17, 2026

View reviewed changes

Comment thread .github/workflows/core_entitities_publish_release_hf.yaml Outdated

JacquesVergine removed the request for review from leelancashire February 18, 2026 11:52

JacquesVergine approved these changes Feb 18, 2026

View reviewed changes

Comment thread .github/workflows/core_entitities_publish_release_hf.yaml Outdated

Comment thread docs/src/pipeline/data/drug_disease_lists.md Outdated

Comment thread docs/src/releases/public_data_releases.md Outdated

matentzn temporarily deployed to dev February 23, 2026 15:54 — with GitHub Actions Inactive

Rephrase hugging face disease list docs

487d7dc

matentzn temporarily deployed to dev February 23, 2026 16:34 — with GitHub Actions Inactive

Add approximate number of diseases in disease list release

31cc314

matentzn had a problem deploying to dev February 23, 2026 16:36 — with GitHub Actions Failure

matentzn had a problem deploying to dev February 23, 2026 16:36 — with GitHub Actions Error

Merge branch 'main' into issueXDATA-282

daff241

matentzn temporarily deployed to dev February 23, 2026 16:37 — with GitHub Actions Inactive

matentzn temporarily deployed to dev February 23, 2026 16:38 — with GitHub Actions Inactive

matentzn merged commit 204cfd0 into main Feb 23, 2026
22 checks passed

matentzn deleted the issueXDATA-282 branch February 23, 2026 17:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Publish EC drug and disease lists to HuggingFace Hub (#XDATA-282)#2073

Publish EC drug and disease lists to HuggingFace Hub (#XDATA-282)#2073
matentzn merged 14 commits intomainfrom
issueXDATA-282

matentzn commented Feb 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JacquesVergine commented Feb 18, 2026

Uh oh!

JacquesVergine left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

matentzn commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Open question

How to run

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JacquesVergine commented Feb 18, 2026

Uh oh!

JacquesVergine left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

matentzn commented Feb 10, 2026 •

edited

Loading