Skip to content

add DataSetInfo and some guards around loading errors#637

Open
jshook wants to merge 11 commits intomainfrom
aws_dl_fix
Open

add DataSetInfo and some guards around loading errors#637
jshook wants to merge 11 commits intomainfrom
aws_dl_fix

Conversation

@jshook
Copy link
Contributor

@jshook jshook commented Feb 20, 2026

This PR makes dataset loading more robust in the face of file corruption errors.
It also indirects dataset access through DataSetInfo, caching the attached DataSet locally, and enabling access via getDataSet(). This will vastly speed up testing flows which are uneccesssarily processing datasets into memory right when the actual data is not yet accessed.

[UPDATE]

Due to a request for fixing some longer-standing issues, the scope of this PR has increased moderately:

  • DataSet loader formats which can not provide metadata (VSF, ...) on their own now pull details from dataset_metadata.yml for VSF. If none such is available, an error should be thrown.

  • Addtionally, the contract type (DataSetProperties) is now a layer of requirements added onto the DataSetInfo type, including an indicator for dataset preprocessing aspects (zero vectors, dupes) and normalization.

  • surefire tests were disabled in examples. enabled, and fixed up a unit test

  • DataSetProperties is a carrier type in the new DataSetInfo contract

After this change, any dataset which is loaded where the VSF can't be resolved from some definitive source will cause an error to be thrown, as it should.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 20, 2026

Before you submit for review:

  • Does your PR follow guidelines from CONTRIBUTIONS.md?
  • Did you summarize what this PR does clearly and concisely?
  • Did you include performance data for changes which may be performance impacting?
  • Did you include useful docs for any user-facing changes or features?
  • Did you include useful javadocs for developer oriented changes, explaining new concepts or key changes?
  • Did you trigger and review regression testing results against the base branch via Run Bench Main?
  • Did you adhere to the code formatting guidelines (TBD)
  • Did you group your changes for easy review, providing meaningful descriptions for each commit?
  • Did you ensure that all files contain the correct copyright header?

If you did not complete any of these, then please explain below.

Copy link
Contributor

@MarkWolters MarkWolters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@tlwillke tlwillke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried it out and like it. Please review my two pieces of feedback. One is easy (HDF5 download notification) and should be addressed now. The other (similarity function) may require more discussion and a new issue.

@jshook jshook requested review from MarkWolters and tlwillke March 19, 2026 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants