Publish EC drug and disease lists to HuggingFace Hub (#XDATA-282)#2073
Merged
Publish EC drug and disease lists to HuggingFace Hub (#XDATA-282)#2073
Conversation
Add drug/disease list publication to the data_publication pipeline, reading versioned parquet files from GCS and writing to everycure/drug-list and everycure/disease-list on HF Hub. The disease list drops six internal columns before publishing. HF publication is triggered automatically from the core_entities release CI on minor/major releases (patches are skipped).
Update public data releases page to refer to 'datasets' and broaden examples (knowledge graphs, drug lists, disease lists). Add drug-list and disease-list entries to the Available Datasets table with Hugging Face links and Docs column, adjust kg-nodes/kg-edges rows, and change 'Both datasets' to 'All datasets'.
matentzn
commented
Feb 10, 2026
matentzn
commented
Feb 10, 2026
This was requested by @JacquesVergine: instead of heaving a single data publication pipeline, you have individual pipelines for KG, drug and disease list!
The HuggingFace disease list publication previously only included curated diseases. It now left-joins the full Mondo disease list with the EC release list, so all Mondo diseases are present with EC enrichment data (specialties, prevalence, categories, etc.) where available. For overlapping columns, EC values take precedence.
Added clarification on the exclusion of certain columns from the public HF release due to their experimental nature.
Remove tests that targeted the internal _drop_disease_hf_columns helper and explicit publish_drug_list_node/publish_disease_list_node behaviors. Keep a single higher-level test that asserts the pipeline contains the expected node names (publish_kg_edges_node, publish_kg_nodes_node). This reduces coupling to implementation details and focuses the test on pipeline composition.
matentzn
commented
Feb 17, 2026
Collaborator
|
Regarding your open question, I think I would go with the manually triggered one. We can automate it easily in the github action itself. |
JacquesVergine
approved these changes
Feb 18, 2026
Collaborator
JacquesVergine
left a comment
There was a problem hiding this comment.
Looks good to me, but wait for May to have a look at the phrasing before merging
…#2080) * Add Git tag support to HFIterableDataset for version-coupled releases HFIterableDataset now accepts an optional `tag` parameter. After push and verification, it creates a Git tag on the HuggingFace Hub repo pinned to the exact commit SHA of the upload. This couples internal EC release versions (e.g. v1.1.0) with HF dataset versions, so users can load a specific release with `load_dataset("everycure/disease-list", revision="v1.1.0")`. * Move the HF pipeline into the kedro cloud space (from base)
eKathleenCarter
pushed a commit
that referenced
this pull request
Feb 24, 2026
) * Publish EC drug and disease lists to HuggingFace Hub (#XDATA-282) Add drug/disease list publication to the data_publication pipeline, reading versioned parquet files from GCS and writing to everycure/drug-list and everycure/disease-list on HF Hub. The disease list drops six internal columns before publishing. HF publication is triggered automatically from the core_entities release CI on minor/major releases (patches are skipped). * Add drug/disease datasets and clarify releases Update public data releases page to refer to 'datasets' and broaden examples (knowledge graphs, drug lists, disease lists). Add drug-list and disease-list entries to the Available Datasets table with Hugging Face links and Docs column, adjust kg-nodes/kg-edges rows, and change 'Both datasets' to 'All datasets'. * Refactor hf pipeline into separate publication pipelines This was requested by @JacquesVergine: instead of heaving a single data publication pipeline, you have individual pipelines for KG, drug and disease list! * Add publish to huggingface core entities release pipeline * HF disease list now merges full Mondo list with EC curated data (#2079) The HuggingFace disease list publication previously only included curated diseases. It now left-joins the full Mondo disease list with the EC release list, so all Mondo diseases are present with EC enrichment data (specialties, prevalence, categories, etc.) where available. For overlapping columns, EC values take precedence. * Remove now unnecessary drop_disease_hf_columns method * Clarify exclusion of columns in HF dataset release Added clarification on the exclusion of certain columns from the public HF release due to their experimental nature. * Simplify data_publication pipeline tests Remove tests that targeted the internal _drop_disease_hf_columns helper and explicit publish_drug_list_node/publish_disease_list_node behaviors. Keep a single higher-level test that asserts the pipeline contains the expected node names (publish_kg_edges_node, publish_kg_nodes_node). This reduces coupling to implementation details and focuses the test on pipeline composition. * Add Git tag support to HFIterableDataset for version-coupled releases (#2080) * Add Git tag support to HFIterableDataset for version-coupled releases HFIterableDataset now accepts an optional `tag` parameter. After push and verification, it creates a Git tag on the HuggingFace Hub repo pinned to the exact commit SHA of the upload. This couples internal EC release versions (e.g. v1.1.0) with HF dataset versions, so users can load a specific release with `load_dataset("everycure/disease-list", revision="v1.1.0")`. * Move the HF pipeline into the kedro cloud space (from base) * Rephrase hugging face disease list docs * Add approximate number of diseases in disease list release
eKathleenCarter
added a commit
that referenced
this pull request
Mar 11, 2026
* Modify outputs from connectivity metrics pipeline * clean up * Publish EC drug and disease lists to HuggingFace Hub (#XDATA-282) (#2073) * Publish EC drug and disease lists to HuggingFace Hub (#XDATA-282) Add drug/disease list publication to the data_publication pipeline, reading versioned parquet files from GCS and writing to everycure/drug-list and everycure/disease-list on HF Hub. The disease list drops six internal columns before publishing. HF publication is triggered automatically from the core_entities release CI on minor/major releases (patches are skipped). * Add drug/disease datasets and clarify releases Update public data releases page to refer to 'datasets' and broaden examples (knowledge graphs, drug lists, disease lists). Add drug-list and disease-list entries to the Available Datasets table with Hugging Face links and Docs column, adjust kg-nodes/kg-edges rows, and change 'Both datasets' to 'All datasets'. * Refactor hf pipeline into separate publication pipelines This was requested by @JacquesVergine: instead of heaving a single data publication pipeline, you have individual pipelines for KG, drug and disease list! * Add publish to huggingface core entities release pipeline * HF disease list now merges full Mondo list with EC curated data (#2079) The HuggingFace disease list publication previously only included curated diseases. It now left-joins the full Mondo disease list with the EC release list, so all Mondo diseases are present with EC enrichment data (specialties, prevalence, categories, etc.) where available. For overlapping columns, EC values take precedence. * Remove now unnecessary drop_disease_hf_columns method * Clarify exclusion of columns in HF dataset release Added clarification on the exclusion of certain columns from the public HF release due to their experimental nature. * Simplify data_publication pipeline tests Remove tests that targeted the internal _drop_disease_hf_columns helper and explicit publish_drug_list_node/publish_disease_list_node behaviors. Keep a single higher-level test that asserts the pipeline contains the expected node names (publish_kg_edges_node, publish_kg_nodes_node). This reduces coupling to implementation details and focuses the test on pipeline composition. * Add Git tag support to HFIterableDataset for version-coupled releases (#2080) * Add Git tag support to HFIterableDataset for version-coupled releases HFIterableDataset now accepts an optional `tag` parameter. After push and verification, it creates a Git tag on the HuggingFace Hub repo pinned to the exact commit SHA of the upload. This couples internal EC release versions (e.g. v1.1.0) with HF dataset versions, so users can load a specific release with `load_dataset("everycure/disease-list", revision="v1.1.0")`. * Move the HF pipeline into the kedro cloud space (from base) * Rephrase hugging face disease list docs * Add approximate number of diseases in disease list release * Update catalog.yml (#2088) * Update litellm targetRevision and image tag to v1.81.12-stable (#2087) * Implement Ontology inclusion metric * Update pipelines/matrix/src/matrix/pipelines/integration/connectivity_metrics.py Co-authored-by: Jacques Vergine <jacques.vergine35@gmail.com> --------- Co-authored-by: Nico Matentzoglu <nicolas.matentzoglu@gmail.com> Co-authored-by: Nelson Alfonso <45660392+Dashing-Nelson@users.noreply.github.com> Co-authored-by: Jacques Vergine <jacques.vergine35@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Following @JacquesVergine's feedback, the drug list, disease list, and KG publication pipelines are now fully decoupled. Each can be run independently without cross-project dependencies.
What changed
New Kedro pipelines in
core_entities:publish_drug_list_hf— publishes the EC drug list to HuggingFace (passthrough)publish_disease_list_hf— publishes the disease list to HuggingFace (merges the full Mondo disease list with EC curated data, drops internal columns)HF disease list is now richer: Instead of just the curated EC diseases, the HF disease list left-joins ALL Mondo diseases with the EC enrichment data (specialties, prevalence, categories, etc.). EC values take precedence where both lists have data for the same disease.
Separate GHA workflow (
core_entitities_publish_release_hf.yaml): HF publication is a manual-dispatch workflow, independent of the existing GCS/BQ release workflow. This means HF releases don't happen automatically on every internal release — you trigger them explicitly when ready.Cleaned up matrix pipeline: Removed drug/disease list nodes from
data_publicationin the matrix pipeline. It now only handles KG publication.Open question
The separate HF workflow means you manually trigger it after a GCS/BQ release. An alternative would be to couple HF publication to non-patch releases (v1.1, v1.2, etc.) automatically. Happy to wire that up if preferred — @JacquesVergine what do you think?
How to run