add DataSetInfo and some guards around loading errors#637
Open
add DataSetInfo and some guards around loading errors#637
Conversation
Contributor
|
Before you submit for review:
If you did not complete any of these, then please explain below. |
tlwillke
reviewed
Mar 12, 2026
...xamples/src/main/java/io/github/jbellis/jvector/example/benchmarks/datasets/DataSetInfo.java
Outdated
Show resolved
Hide resolved
tlwillke
reviewed
Mar 13, 2026
...s/src/main/java/io/github/jbellis/jvector/example/benchmarks/datasets/DataSetLoaderHDF5.java
Outdated
Show resolved
Hide resolved
tlwillke
reviewed
Mar 13, 2026
...s/src/main/java/io/github/jbellis/jvector/example/benchmarks/datasets/DataSetLoaderHDF5.java
Show resolved
Hide resolved
tlwillke
requested changes
Mar 13, 2026
Collaborator
tlwillke
left a comment
There was a problem hiding this comment.
I tried it out and like it. Please review my two pieces of feedback. One is easy (HDF5 download notification) and should be addressed now. The other (similarity function) may require more discussion and a new issue.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR makes dataset loading more robust in the face of file corruption errors.
It also indirects dataset access through DataSetInfo, caching the attached DataSet locally, and enabling access via getDataSet(). This will vastly speed up testing flows which are uneccesssarily processing datasets into memory right when the actual data is not yet accessed.
[UPDATE]
Due to a request for fixing some longer-standing issues, the scope of this PR has increased moderately:
DataSet loader formats which can not provide metadata (VSF, ...) on their own now pull details from dataset_metadata.yml for VSF. If none such is available, an error should be thrown.
Addtionally, the contract type (DataSetProperties) is now a layer of requirements added onto the DataSetInfo type, including an indicator for dataset preprocessing aspects (zero vectors, dupes) and normalization.
surefire tests were disabled in examples. enabled, and fixed up a unit test
DataSetProperties is a carrier type in the new DataSetInfo contract
After this change, any dataset which is loaded where the VSF can't be resolved from some definitive source will cause an error to be thrown, as it should.