HeartBioPortal DataHub

HeartBioPortal DataHub is a version-controlled collection of cardiovascular omics datasets. Each dataset includes standardised metadata and provenance information so that analyses can be reproduced and referenced.

Quick Start

git clone <repo-url>
cd DataHub
pip install -r requirements.txt
make validate
# or using docker
docker compose up validation

Git Large File Storage

This repository uses Git LFS for storing large binary datasets. Install Git LFS before cloning:

git lfs install

Dataset Layout

Datasets are organised under public/ for open data or private/ for embargoed submissions. A typical dataset directory contains:

<dataset>/
  metadata.json      # descriptive metadata
  provenance.json    # processing provenance
  data files...

The JSON schemas that describe these files live under schemas/ and are also rendered in the documentation.

You can list available datasets using:

tools/list_datasets.py

Validation

Use the helper script to check a dataset before opening a pull request:

tools/hbp-validate public/example_fh_vcf

or run all tests with make validate.

Contributing

We welcome new datasets and improvements. See CONTRIBUTING.md for a walkthrough of the submission process and consult the files in the docs/ directory for more details.

Processing Very Large Datasets

See tools/large_dataset_processor.py for an example using Dask to analyse VCF files over 500 GB. Run pip install -r requirements.txt and execute:

python tools/large_dataset_processor.py <your.vcf>

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.datalad		.datalad
.github		.github
docs		docs
public/example_fh_vcf		public/example_fh_vcf
schemas		schemas
scripts		scripts
src		src
tests		tests
tools		tools
.gitattributes		.gitattributes
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
GOVERNANCE.md		GOVERNANCE.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
mkdocs.yml		mkdocs.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HeartBioPortal DataHub

Quick Start

Git Large File Storage

Dataset Layout

Validation

Contributing

Processing Very Large Datasets

About

Uh oh!

Releases

Packages

Languages

License

hl150/DataHub

Folders and files

Latest commit

History

Repository files navigation

HeartBioPortal DataHub

Quick Start

Git Large File Storage

Dataset Layout

Validation

Contributing

Processing Very Large Datasets

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages