Skip to content

[FIX] Fix three robustness gaps in OpenMLDataset.__repr__, get_data(), and create_dataset#1713

Open
phantom-712 wants to merge 4 commits intoopenml:mainfrom
phantom-712:main
Open

[FIX] Fix three robustness gaps in OpenMLDataset.__repr__, get_data(), and create_dataset#1713
phantom-712 wants to merge 4 commits intoopenml:mainfrom
phantom-712:main

Conversation

@phantom-712
Copy link

Metadata

  • Reference Issue: Fixes [FIX] : Three dataset robustness gaps in OpenMLDataset and create_dataset #1711
  • New Tests Added: No
  • Documentation Updated: Yes (docstring for default_target_attribute in create_dataset)
  • Change Log Entry: "Fix KeyError in OpenMLDataset.repr on partial qualities, bare KeyError in get_data() on invalid target, and misleading str annotation on create_dataset default_target_attribute"

Details

What does this PR fix?

Three related robustness gaps in openml/datasets/dataset.py and openml/datasets/functions.py, all verified on openml==0.16.0.

Bug 1 - OpenMLDataset.__repr__ crashes with KeyError on partial or NaN qualities

_get_repr_body_fields accessed _qualities with direct dict indexing. For newly uploaded datasets where the server has only partially computed qualities, repr() crashed with a KeyError when NumberOfFeatures or NumberOfInstances keys were absent or NaN. Fixed by replacing direct indexing with .get() and a pd.isna() guard, missing or NaN quality keys are silently omitted from the repr output.

Bug 2 - get_data() raises a bare pandas KeyError on invalid target

get_data() called data.drop(columns=[target_name]) without first checking whether target_name existed in data.columns. When the column was absent (typo, or silently removed by include_row_id=False / include_ignore_attribute=False), pandas raised a raw KeyError with no OpenML context. Fixed by adding an explicit pre-drop check that raises a ValueError with a clear message listing available columns, and a separate message when the column was filtered out by an OpenML flag.

Bug 3 - create_dataset has a misleading str annotation on default_target_attribute

The REST API allows default_target_attribute to be None for unsupervised datasets. The Python function annotated it as str, causing mypy errors and giving users no indication that None was valid. An existing TODO in the code explicitly acknowledged this mismatch. Fixed by changing the annotation to str | None, updating the docstring to document None as valid for unsupervised use cases, and removing the TODO.

Scope

All changes are contained in two files only:

  • openml/datasets/dataset.py - Bugs 1 and 2
  • openml/datasets/functions.py - Bug 3

No changes to the REST API surface, no new public methods, no new files.

Reproduction

Bug 1:

import openml
ds = openml.datasets.get_dataset(61, download_data=False, download_qualities=True)
ds._qualities = {"SomeOtherQuality": 1.0}
repr(ds)  # KeyError before fix, clean output after

Bug 2:

import openml
ds = openml.datasets.get_dataset(61, download_data=True)
ds.get_data(target="nonexistent_column")  # bare pandas KeyError before, clear ValueError after

Bug 3:

# mypy flagged create_dataset(..., default_target_attribute=None) as a type error before fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FIX] : Three dataset robustness gaps in OpenMLDataset and create_dataset

1 participant