Test: Assert specific evaluation scores for sample data#1249
Conversation
|
Thanks for the contribution. I would prefer a "benchmark" unit test that specifies the model via config as explicitly as possible (for example using a commit hash from hugging face, not just revision= We already have this assertion in the unit test in the PR: which we could bump to Please could you also include the AI assistance declaration from the PR template? (you can see an example here) |
|
Thanks for the guidance! I completely understand your point that separating the sanity checks from strict benchmarking is a better approach for long-term maintenance of this package. Now I have updated it accordingly also updated this PR's description. 👍 |
ethanwhite
left a comment
There was a problem hiding this comment.
This looks really close to me. Just one recommendation for clarifying a code comment. @jveitchmichaelis - do you see anything else?
| # Relaxed assertions (Sanity Check only) | ||
| # Allows model improvements without breaking tests |
There was a problem hiding this comment.
I'd change these comments to something like:
Check that precision and recall don't regress below reasonable baselines
| Benchmark test to ensure the specific release version of the model | ||
| produces consistent results. | ||
| """ | ||
| # Load the model using a SPECIFIC revision (Commit SHA) |
There was a problem hiding this comment.
We can remove this comment too
|
Looks like we also need to rebase, and there is another comment I would suggest removing as it doesn't provide much. I think that in the long run, we probably want to use tags or branches rather than opaque commit hashes but that's not an issue right now. |
Description
Previously,
test_evaluateonly checked if the execution completed without errors, but did not assert the quality of the results. This PR adds assertions to ensure the model produces consistent precision and recall scores on the sample data.Changes Made
Testing
pytest tests/test_main.py::test_evaluatelocally.Linked Issue
Closes #1233
AI-Assisted Development
I utilized AI for sanity checking code logic and conducting an assisted review to identify potential missing resets and schema constraints.