HeartBioPortal DataHub is a version-controlled collection of cardiovascular omics datasets. Each dataset includes standardised metadata and provenance information so that analyses can be reproduced and referenced.
git clone <repo-url>
cd DataHub
pip install -r requirements.txt
make validate
# or using docker
docker compose up validationThis repository uses Git LFS for storing large binary datasets. Install Git LFS before cloning:
git lfs installDatasets are organised under public/ for open data or private/ for embargoed submissions. A typical dataset directory contains:
<dataset>/
metadata.json # descriptive metadata
provenance.json # processing provenance
data files...
The JSON schemas that describe these files live under schemas/ and are also rendered in the documentation.
You can list available datasets using:
tools/list_datasets.pyUse the helper script to check a dataset before opening a pull request:
tools/hbp-validate public/example_fh_vcfor run all tests with make validate.
We welcome new datasets and improvements. See CONTRIBUTING.md for a walkthrough of the submission process and consult the files in the docs/ directory for more details.
See tools/large_dataset_processor.py for an example using Dask to analyse VCF files over 500 GB. Run pip install -r requirements.txt and execute:
python tools/large_dataset_processor.py <your.vcf>