Skip to content

feat: add PTBXLDataset and PTBXLMultilabelClassification task#2

Open
anuragd-UIUC wants to merge 10 commits intojtwells2:masterfrom
anuragd-UIUC:feature_anuragd2/ptbxl-dataset-and-task
Open

feat: add PTBXLDataset and PTBXLMultilabelClassification task#2
anuragd-UIUC wants to merge 10 commits intojtwells2:masterfrom
anuragd-UIUC:feature_anuragd2/ptbxl-dataset-and-task

Conversation

@anuragd-UIUC
Copy link
Copy Markdown
Collaborator

  • pyhealth/datasets/ptbxl.py: BaseSignalDataset subclass for PTB-XL v1.0.3
  • pyhealth/tasks/ptbxl_multilabel_classification.py: 5-class superdiagnostic task
  • examples/ptbxl_superdiagnostic_sparcnet.ipynb: ablation study (SparcNet vs BiLSTM)
  • tests/core/test_ptbxl.py: unit tests
  • cs598_project/: CS-598 course project pipeline notebook

Results: SparcNet ROC-AUC 0.9278, BiLSTMECG ROC-AUC 0.9155 on PTB-XL test set

Copy link
Copy Markdown
Collaborator

@sl4mmy sl4mmy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few questions inline regarding the config file & dataset implementation, but otherwise this looks good to me. 👍

Comment thread pyhealth/datasets/configs/ptbxl.yaml Outdated
@@ -0,0 +1,13 @@
version: "1.0.0"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine, but should confirm with @jtwells2 (I didn't see a config file in his recent commits)

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to manually deconflict this. I'd rather not use the YAML file at all.

Comment thread pyhealth/datasets/ptbxl.py Outdated
@@ -1,208 +1,277 @@
import pandas as pd
from pathlib import Path
"""PTB-XL ECG Dataset for PyHealth.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this relate to what @jtwells2 added in commit e813ff8? Do we need to merge the two dataset impls? Or do you just need to rebase on his lastest changes?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I created a branch and didn’t run git pull before pushing my changes. I will update the repo with latest changes.

@anuragd-UIUC anuragd-UIUC force-pushed the feature_anuragd2/ptbxl-dataset-and-task branch from 99c7a58 to d5e5ea6 Compare April 13, 2026 05:39
Copy link
Copy Markdown
Collaborator Author

@anuragd-UIUC anuragd-UIUC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier was using stubbed file, rebased ptbxl.py file.

Comment thread pyhealth/datasets/configs/ptbxl.yaml Outdated
@@ -0,0 +1,13 @@
version: "1.0.0"
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to manually deconflict this. I'd rather not use the YAML file at all.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to make this change to test PTBXLDataset. Pull the latest version, should be there.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Examples need to go into the examples folder, anything else should be removed from this PR.

Comment thread cs598_project/PTBXLImplementation.md Outdated
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Examples need to go into the examples folder, anything else should be removed from this PR.

Comment thread cs598_project/se_resnet_ecg.py Outdated
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Examples need to go into the examples folder, anything else should be removed from this PR.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change already been made, please remove.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this from the PR, don't need it after the current PTBXL updates.

Comment thread pyhealth/models/__init__.py
Comment thread pyhealth/models/bilstm_ecg.py Outdated
Comment thread tests/core/test_ptbxl.py
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, I will cover the dataset testing. Need to make this a test for the multilabel classification task.

- pyhealth/datasets/ptbxl.py: BaseSignalDataset subclass for PTB-XL v1.0.3
- pyhealth/tasks/ptbxl_multilabel_classification.py: 5-class superdiagnostic task
- examples/ptbxl_superdiagnostic_sparcnet.ipynb: ablation study (SparcNet vs BiLSTM)
- tests/core/test_ptbxl.py: unit tests
- cs598_project/: CS-598 course project pipeline notebook

Results: SparcNet ROC-AUC 0.9278, BiLSTMECG ROC-AUC 0.9155 on PTB-XL test set
…n design

- SNOMED_TO_SUPERDIAG: use jtwells2's 46-code clinically correct mapping
  (from pyhealth/tasks/ptbxl_multilabel_classification.py)
- Signal: full 10s at 100 Hz (decimate 500->100 Hz per jtwells2's signal[:, ::5])
  → shape (12, 1000) instead of old (12, 1250) windowed slices
- Schema: 'labels' key (plural) matching jtwells2's PTBXLMultilabelClassification output
- Samples: 21,767 (1 per recording) vs old 152,859 (7 windows per recording)
- SUPERDIAG_CLASSES ordering: [NORM, MI, STTC, CD, HYP] per jtwells2
- Cache: cinc_100hz/ instead of old windows/ directory
- Fix: set dataset.refresh_cache=True to overwrite stale BaseSignalDataset cache
- Update all downstream cells: split, DataLoaders, SparcNet, BiLSTMECG
- Both models trained 5 epochs on CPU; pipeline fully validated end-to-end
@anuragd-UIUC anuragd-UIUC force-pushed the feature_anuragd2/ptbxl-dataset-and-task branch from 033fed6 to 6ed6a48 Compare April 19, 2026 09:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants