Test: Assert specific evaluation scores for sample data by musaqlain · Pull Request #1249 · weecology/DeepForest

musaqlain · 2025-12-25T20:43:32Z

Description

Previously, test_evaluate only checked if the execution completed without errors, but did not assert the quality of the results. This PR adds assertions to ensure the model produces consistent precision and recall scores on the sample data.

Changes Made

Relaxed Unit Test: I modified test_evaluate in tests/test_main.py to use looser bounds (> 0.7) so it acts as a sanity check for future models.
New Benchmark Test: I created tests/test_benchmark.py which loads the specific model revision (SHA) using m.load_model(..., revision=...). This test asserts the exact precision/recall values (approx 0.80 / 0.72) to ensure reproducibility for this specific release.

Testing

Ran pytest tests/test_main.py::test_evaluate locally.
Result: Passed.

Linked Issue

Closes #1233

AI-Assisted Development

I utilized AI for sanity checking code logic and conducting an assisted review to identify potential missing resets and schema constraints.

I used AI tools (e.g., GitHub Copilot, ChatGPT, etc.) in developing this PR
I understand all the code I'm submitting

jveitchmichaelis · 2026-01-01T18:12:19Z

Thanks for the contribution. I would prefer a "benchmark" unit test that specifies the model via config as explicitly as possible (for example using a commit hash from hugging face, not just revision=main as that could also change), and then makes this check. Then we could add other tests for newer models etc.

We already have this assertion in the unit test in the PR:

assert np.round(results["box_precision"], 2) > 0.5
assert np.round(results["box_recall"], 2) > 0.5

which we could bump to 0.7 if we want tighter checks. But I don't think this test is the right place for a "close" assertion as it will break if we ever release an improved model.

Please could you also include the AI assistance declaration from the PR template? (you can see an example here)

musaqlain · 2026-01-08T01:13:46Z

Thanks for the guidance! I completely understand your point that separating the sanity checks from strict benchmarking is a better approach for long-term maintenance of this package. Now I have updated it accordingly also updated this PR's description. 👍

ethanwhite

This looks really close to me. Just one recommendation for clarifying a code comment. @jveitchmichaelis - do you see anything else?

ethanwhite · 2026-05-08T17:30:44Z

+    # Relaxed assertions (Sanity Check only)
+    # Allows model improvements without breaking tests


I'd change these comments to something like:

Check that precision and recall don't regress below reasonable baselines

jveitchmichaelis · 2026-05-08T17:34:37Z

+    Benchmark test to ensure the specific release version of the model
+    produces consistent results.
+    """
+    # Load the model using a SPECIFIC revision (Commit SHA)


We can remove this comment too

jveitchmichaelis · 2026-05-08T17:36:46Z

Looks like we also need to rebase, and there is another comment I would suggest removing as it doesn't provide much. I think that in the long run, we probably want to use tags or branches rather than opaque commit hashes but that's not an issue right now.

Test: Assert specific evaluation scores for sample data

91d87da

Refactor: Move strict eval checks to benchmark test and relax unit test

a9b3b5e

ethanwhite requested changes May 8, 2026

View reviewed changes

jveitchmichaelis reviewed May 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test: Assert specific evaluation scores for sample data#1249

Test: Assert specific evaluation scores for sample data#1249
musaqlain wants to merge 2 commits into
weecology:mainfrom
musaqlain:test-model-accuracy

musaqlain commented Dec 25, 2025 •

edited

Loading

Uh oh!

jveitchmichaelis commented Jan 1, 2026 •

edited

Loading

Uh oh!

musaqlain commented Jan 8, 2026

Uh oh!

ethanwhite left a comment

Uh oh!

ethanwhite May 8, 2026

Uh oh!

jveitchmichaelis May 8, 2026

Uh oh!

jveitchmichaelis commented May 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		# Relaxed assertions (Sanity Check only)
		# Allows model improvements without breaking tests

Conversation

musaqlain commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes Made

Testing

Linked Issue

AI-Assisted Development

Uh oh!

jveitchmichaelis commented Jan 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

musaqlain commented Jan 8, 2026

Uh oh!

ethanwhite left a comment

Choose a reason for hiding this comment

Uh oh!

ethanwhite May 8, 2026

Choose a reason for hiding this comment

Uh oh!

jveitchmichaelis May 8, 2026

Choose a reason for hiding this comment

Uh oh!

jveitchmichaelis commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

musaqlain commented Dec 25, 2025 •

edited

Loading

jveitchmichaelis commented Jan 1, 2026 •

edited

Loading

jveitchmichaelis commented May 8, 2026 •

edited

Loading