[FIX] Fix three robustness gaps in OpenMLDataset.__repr__, get_data(), and create_dataset by phantom-712 · Pull Request #1713 · openml/openml-python

phantom-712 · 2026-03-17T12:02:49Z

Metadata

Reference Issue: Fixes [FIX] : Three dataset robustness gaps in OpenMLDataset and create_dataset #1711
New Tests Added: No
Documentation Updated: Yes (docstring for default_target_attribute in create_dataset)
Change Log Entry: "Fix KeyError in OpenMLDataset.repr on partial qualities, bare KeyError in get_data() on invalid target, and misleading str annotation on create_dataset default_target_attribute"

Details

What does this PR fix?

Three related robustness gaps in openml/datasets/dataset.py and openml/datasets/functions.py, all verified on openml==0.16.0.

Bug 1 - OpenMLDataset.__repr__ crashes with KeyError on partial or NaN qualities

_get_repr_body_fields accessed _qualities with direct dict indexing. For newly uploaded datasets where the server has only partially computed qualities, repr() crashed with a KeyError when NumberOfFeatures or NumberOfInstances keys were absent or NaN. Fixed by replacing direct indexing with .get() and a pd.isna() guard, missing or NaN quality keys are silently omitted from the repr output.

Bug 2 - get_data() raises a bare pandas KeyError on invalid target

get_data() called data.drop(columns=[target_name]) without first checking whether target_name existed in data.columns. When the column was absent (typo, or silently removed by include_row_id=False / include_ignore_attribute=False), pandas raised a raw KeyError with no OpenML context. Fixed by adding an explicit pre-drop check that raises a ValueError with a clear message listing available columns, and a separate message when the column was filtered out by an OpenML flag.

Bug 3 - create_dataset has a misleading str annotation on default_target_attribute

The REST API allows default_target_attribute to be None for unsupervised datasets. The Python function annotated it as str, causing mypy errors and giving users no indication that None was valid. An existing TODO in the code explicitly acknowledged this mismatch. Fixed by changing the annotation to str | None, updating the docstring to document None as valid for unsupervised use cases, and removing the TODO.

Scope

All changes are contained in two files only:

openml/datasets/dataset.py - Bugs 1 and 2
openml/datasets/functions.py - Bug 3

No changes to the REST API surface, no new public methods, no new files.

Reproduction

Bug 1:

import openml
ds = openml.datasets.get_dataset(61, download_data=False, download_qualities=True)
ds._qualities = {"SomeOtherQuality": 1.0}
repr(ds)  # KeyError before fix, clean output after

Bug 2:

import openml
ds = openml.datasets.get_dataset(61, download_data=True)
ds.get_data(target="nonexistent_column")  # bare pandas KeyError before, clear ValueError after

Bug 3:

# mypy flagged create_dataset(..., default_target_attribute=None) as a type error before fix

Made-with: Cursor

phantom-712 and others added 4 commits March 17, 2026 17:13

FIX: Three dataset robustness gaps in OpenMLDataset and create_dataset

39ddfeb

FIX: reduce branch count in get_data to satisfy PLR0912

e32a82b

Made-with: Cursor

fixed pre-commit error

ad6203b

fixed according to max number of characters in a line rule

ee8a741

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FIX] Fix three robustness gaps in OpenMLDataset.repr, get_data(), and create_dataset#1713

[FIX] Fix three robustness gaps in OpenMLDataset.repr, get_data(), and create_dataset#1713
phantom-712 wants to merge 4 commits into
openml:mainfrom
phantom-712:main

phantom-712 commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

phantom-712 commented Mar 17, 2026

Metadata

Details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant