GitHub - VectorInstitute/odyssey: A toolkit for developing foundation models using Electronic Health Record (EHR) data.

Odyssey is a toolkit for building clinical foundation models from Electronic Health Records (EHR). It uses EHR-Mamba3 — a Mamba-3 SSM backbone enriched with clinical embeddings (token types, timestamps, ages, visit structure) — and a MEDS 0.4+ pipeline that converts raw MIMIC-IV data to the model-ready format.

Architecture

EHR-Mamba3 wraps mamba_ssm.MambaLMHeadModel with ssm_cfg={"layer": "Mamba3"} and injects six EHR-specific embedding streams via a caching bridge (CachedEHREmbeddings) that replaces the backbone's standard token embedding layer.

MIMIC-IV CSVs
    ↓  scripts/meds/extract_mimic_iv.py
MEDS parquet  (subject_id · time · code · numeric_value)
    ↓  meds-transforms 0.6.1  (filter + normalize)
    ↓  scripts/meds/meds_to_odyssey.py
Odyssey parquet  (event_tokens · type_tokens · time_tokens · age_tokens · …)
    ↓
pretrain.py  →  Mamba3Pretrain  (next-token prediction)
finetune.py  →  Mamba3Finetune  (single- or multi-task classification)

Installation

Python ≥ 3.12 and uv are required.

git clone https://github.com/VectorInstitute/odyssey.git
cd odyssey
uv sync --dev

On a CUDA-capable GPU host, also install mamba-ssm (requires nvcc):

# Install the torch wheel matching your CUDA version first, e.g. CUDA 12.8:
uv pip install torch --index-url https://download.pytorch.org/whl/cu128

# Then build mamba-ssm from source:
uv sync --extra cuda --no-build-isolation

Data Pipeline

1 — Download MIMIC-IV 3.1

PhysioNet credentials are required. Download using wget or the physionet client:

wget -r -N -c -np --user <physionet_user> --ask-password \
    https://physionet.org/files/mimiciv/3.1/ \
    -P data/

2 — Run the end-to-end pipeline

bash scripts/meds/run_pipeline.sh \
    --mimic_dir  data/physionet.org/files/mimiciv/3.1 \
    --output_dir data/pipeline_output \
    --max_seq_len 2048

This runs three steps:

Step	Script	Input → Output
Extract	`scripts/meds/extract_mimic_iv.py`	MIMIC-IV CSVs → MEDS parquet shards
Transform	`scripts/meds/pipeline.yaml` (meds-transforms 0.6.1)	Filter subjects/codes, normalize numeric values
Tokenize	`scripts/meds/meds_to_odyssey.py`	MEDS → odyssey token sequence parquets

The output at data/pipeline_output/odyssey_tokenized/ contains train/, tuning/, and held_out/ splits ready for training.

Training

Pre-training

python pretrain.py \
    --data_dir       data/pipeline_output/odyssey_tokenized \
    --sequence_file  train.parquet \
    --id_file        subject_ids.json \
    --vocab_dir      data/vocab \
    --config_path    odyssey/models/configs/ehr_mamba3.yaml \
    --batch_size     32 \
    --max_len        2048

Fine-tuning

python finetune.py \
    --pretrain_model_path checkpoints/pretrain.ckpt \
    --data_dir            data/pipeline_output/odyssey_tokenized \
    --vocab_dir           data/vocab \
    --task                mortality \
    --num_labels          2

Model Configuration

Default Mamba-3 hyperparameters (odyssey/models/configs/ehr_mamba3.yaml):

model:
  embedding_size: 768
  num_hidden_layers: 32
  state_size: 128       # d_state per SSM block
  headdim: 64           # head dimension
  is_mimo: true         # Multi-Input Multi-Output mode
  mimo_rank: 4
  chunk_size: 256

Project Structure

odyssey/
├── data/
│   ├── dataset.py       # PretrainDatasetDecoder, FinetuneDatasetDecoder, …
│   └── tokenizer.py     # ConceptTokenizer
├── evals/
│   ├── evaluation.py    # calculate_metrics (AUROC, F1, …)
│   └── prediction.py    # Forecast (autoregressive token generation)
├── models/
│   ├── embeddings.py    # CachedEHREmbeddings, TimeEmbeddingLayer, VisitEmbedding
│   ├── ehr_mamba3/
│   │   └── model.py     # Mamba3Pretrain, Mamba3Finetune
│   └── configs/
│       └── ehr_mamba3.yaml
└── utils/
scripts/
└── meds/
    ├── extract_mimic_iv.py   # MIMIC-IV → MEDS
    ├── meds_to_odyssey.py    # MEDS → odyssey format
    ├── pipeline.yaml         # meds-transforms pipeline
    └── run_pipeline.sh       # end-to-end runner
pretrain.py
finetune.py

Contributing

Issues and pull requests are welcome. Please open an issue before starting large changes.

Citation

If you use Odyssey or EHR-Mamba3 in your research, please cite:

@misc{fallahpour2024ehrmamba,
  title   = {EHRMamba: Towards Generalizable and Scalable Foundation Models for Electronic Health Records},
  author  = {Adibvafa Fallahpour and Mahshid Alinoori and Arash Afkanpour and Amrit Krishnan},
  year    = {2024},
  eprint  = {2405.14567},
  archivePrefix = {arXiv},
  primaryClass  = {cs.LG}
}

Name		Name	Last commit message	Last commit date
Latest commit History 623 Commits
.github		.github
assets		assets
odyssey		odyssey
scripts/meds		scripts/meds
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
README.md		README.md
finetune.py		finetune.py
pretrain.py		pretrain.py
pyproject.toml		pyproject.toml
results.md		results.md
slurm_scripts.md		slurm_scripts.md
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Architecture

Installation

Data Pipeline

1 — Download MIMIC-IV 3.1

2 — Run the end-to-end pipeline

Training

Pre-training

Fine-tuning

Model Configuration

Project Structure

Contributing

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Architecture

Installation

Data Pipeline

1 — Download MIMIC-IV 3.1

2 — Run the end-to-end pipeline

Training

Pre-training

Fine-tuning

Model Configuration

Project Structure

Contributing

Citation

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages