Speed up validation for datasets with same-size images by jveitchmichaelis · Pull Request #1373 · weecology/DeepForest

jveitchmichaelis · 2026-04-15T00:01:31Z

Description

Most of the time we train on datasets where images are the same size. We spend a lot of time during validation opening images to check what size they are, so we can confirm that annotations are in bounds. A simple optimization if we know up-front that the dataset doesn't have varying sized images, is to take the size of the first one and assume it's correct for the rest of the dataset.

This defaults to False, which is the brute force approach, but in cases where we know the dataset is good, or the images are the same size, this can save quite a bit of time and disk thrashing.

Also cleaned up the keypoint validation checker to have similar structure.

AI-Assisted Development

I used AI tools (e.g., GitHub Copilot, ChatGPT, etc.) in developing this PR
I understand all the code I'm submitting
I have reviewed and validated all AI-generated code

AI tools used (if applicable):

Claude

codecov · 2026-04-15T00:36:46Z

Codecov Report

❌ Patch coverage is 69.38776% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.48%. Comparing base (408e150) to head (3c59532).
⚠️ Report is 5 commits behind head on main.

Files with missing lines	Patch %	Lines
src/deepforest/datasets/training.py	68.08%	15 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1373      +/-   ##
==========================================
- Coverage   86.87%   86.48%   -0.40%     
==========================================
  Files          24       24              
  Lines        3064     3205     +141     
==========================================
+ Hits         2662     2772     +110     
- Misses        402      433      +31

Flag	Coverage Δ
unittests	`86.48% <69.38%> (-0.40%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

bw4sz · 2026-04-15T22:01:08Z

I wonder if instead of this we should just make validate_coordinates() something a user can call from DeepForest.main, it seems really heavy every time that we run we open every image. I think the underlying idea here is wrong and while this fixes it, we should probably allow models to fail and then have docs to check validate coordinates to help users find that error?

jveitchmichaelis · 2026-04-15T23:34:29Z

Or have a CLI script which runs the sanity checks? But yes exactly, the current behaviour is to open every image every time which adds minutes to large training runs.

ethanwhite · 2026-04-15T23:42:07Z

We originally added this code because we were getting a lot of error reports that were from users with buggy annotations, which is tricky to debug.

Related to #1285, which is probably close to ready and where I think @bw4sz is saying that we should keep the checks in place?

bw4sz · 2026-04-16T17:52:17Z

closed in favor of a solution to issue #1374

speed up validation for datasets with same-size images

3c59532

jveitchmichaelis mentioned this pull request Apr 15, 2026

WIP: Treeformer #1371

Closed

3 tasks

jveitchmichaelis marked this pull request as ready for review April 15, 2026 00:02

jveitchmichaelis mentioned this pull request Apr 15, 2026

optimize coordinate validation, handling both -ve and OOB error #1285

Open

2 tasks

bw4sz self-requested a review April 15, 2026 21:56

bw4sz closed this Apr 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up validation for datasets with same-size images#1373

Speed up validation for datasets with same-size images#1373
jveitchmichaelis wants to merge 1 commit into
weecology:mainfrom
jveitchmichaelis:dataset-same-size-images

jveitchmichaelis commented Apr 15, 2026

Uh oh!

codecov Bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

bw4sz commented Apr 15, 2026

Uh oh!

jveitchmichaelis commented Apr 15, 2026

Uh oh!

ethanwhite commented Apr 15, 2026

Uh oh!

bw4sz commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jveitchmichaelis commented Apr 15, 2026

Description

AI-Assisted Development

Uh oh!

codecov Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

bw4sz commented Apr 15, 2026

Uh oh!

jveitchmichaelis commented Apr 15, 2026

Uh oh!

ethanwhite commented Apr 15, 2026

Uh oh!

bw4sz commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented Apr 15, 2026 •

edited

Loading