[FIX] Fix three robustness gaps in OpenMLDataset.__repr__, get_data(), and create_dataset#1713
Open
phantom-712 wants to merge 4 commits intoopenml:mainfrom
Open
[FIX] Fix three robustness gaps in OpenMLDataset.__repr__, get_data(), and create_dataset#1713phantom-712 wants to merge 4 commits intoopenml:mainfrom
phantom-712 wants to merge 4 commits intoopenml:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Metadata
OpenMLDatasetandcreate_dataset#1711default_target_attributeincreate_dataset)Details
What does this PR fix?
Three related robustness gaps in
openml/datasets/dataset.pyandopenml/datasets/functions.py, all verified on openml==0.16.0.Bug 1 -
OpenMLDataset.__repr__crashes with KeyError on partial or NaN qualities_get_repr_body_fieldsaccessed_qualitieswith direct dict indexing. For newly uploaded datasets where the server has only partially computed qualities,repr()crashed with aKeyErrorwhenNumberOfFeaturesorNumberOfInstanceskeys were absent or NaN. Fixed by replacing direct indexing with.get()and apd.isna()guard, missing or NaN quality keys are silently omitted from the repr output.Bug 2 -
get_data()raises a bare pandas KeyError on invalid targetget_data()calleddata.drop(columns=[target_name])without first checking whethertarget_nameexisted indata.columns. When the column was absent (typo, or silently removed byinclude_row_id=False/include_ignore_attribute=False), pandas raised a rawKeyErrorwith no OpenML context. Fixed by adding an explicit pre-drop check that raises aValueErrorwith a clear message listing available columns, and a separate message when the column was filtered out by an OpenML flag.Bug 3 -
create_datasethas a misleading str annotation ondefault_target_attributeThe REST API allows
default_target_attributeto beNonefor unsupervised datasets. The Python function annotated it asstr, causing mypy errors and giving users no indication thatNonewas valid. An existing TODO in the code explicitly acknowledged this mismatch. Fixed by changing the annotation tostr | None, updating the docstring to documentNoneas valid for unsupervised use cases, and removing the TODO.Scope
All changes are contained in two files only:
openml/datasets/dataset.py- Bugs 1 and 2openml/datasets/functions.py- Bug 3No changes to the REST API surface, no new public methods, no new files.
Reproduction
Bug 1:
Bug 2:
Bug 3:
# mypy flagged create_dataset(..., default_target_attribute=None) as a type error before fix