Skip to content

Speed up validation for datasets with same-size images#1373

Closed
jveitchmichaelis wants to merge 1 commit into
weecology:mainfrom
jveitchmichaelis:dataset-same-size-images
Closed

Speed up validation for datasets with same-size images#1373
jveitchmichaelis wants to merge 1 commit into
weecology:mainfrom
jveitchmichaelis:dataset-same-size-images

Conversation

@jveitchmichaelis
Copy link
Copy Markdown
Collaborator

Description

Most of the time we train on datasets where images are the same size. We spend a lot of time during validation opening images to check what size they are, so we can confirm that annotations are in bounds. A simple optimization if we know up-front that the dataset doesn't have varying sized images, is to take the size of the first one and assume it's correct for the rest of the dataset.

This defaults to False, which is the brute force approach, but in cases where we know the dataset is good, or the images are the same size, this can save quite a bit of time and disk thrashing.

Also cleaned up the keypoint validation checker to have similar structure.

AI-Assisted Development

  • I used AI tools (e.g., GitHub Copilot, ChatGPT, etc.) in developing this PR
  • I understand all the code I'm submitting
  • I have reviewed and validated all AI-generated code

AI tools used (if applicable):

Claude

@jveitchmichaelis jveitchmichaelis mentioned this pull request Apr 15, 2026
3 tasks
@jveitchmichaelis jveitchmichaelis marked this pull request as ready for review April 15, 2026 00:02
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 15, 2026

Codecov Report

❌ Patch coverage is 69.38776% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.48%. Comparing base (408e150) to head (3c59532).
⚠️ Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
src/deepforest/datasets/training.py 68.08% 15 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1373      +/-   ##
==========================================
- Coverage   86.87%   86.48%   -0.40%     
==========================================
  Files          24       24              
  Lines        3064     3205     +141     
==========================================
+ Hits         2662     2772     +110     
- Misses        402      433      +31     
Flag Coverage Δ
unittests 86.48% <69.38%> (-0.40%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@bw4sz bw4sz self-requested a review April 15, 2026 21:56
@bw4sz
Copy link
Copy Markdown
Collaborator

bw4sz commented Apr 15, 2026

I wonder if instead of this we should just make validate_coordinates() something a user can call from DeepForest.main, it seems really heavy every time that we run we open every image. I think the underlying idea here is wrong and while this fixes it, we should probably allow models to fail and then have docs to check validate coordinates to help users find that error?

@jveitchmichaelis
Copy link
Copy Markdown
Collaborator Author

Or have a CLI script which runs the sanity checks? But yes exactly, the current behaviour is to open every image every time which adds minutes to large training runs.

@ethanwhite
Copy link
Copy Markdown
Member

We originally added this code because we were getting a lot of error reports that were from users with buggy annotations, which is tricky to debug.

Related to #1285, which is probably close to ready and where I think @bw4sz is saying that we should keep the checks in place?

@bw4sz
Copy link
Copy Markdown
Collaborator

bw4sz commented Apr 16, 2026

closed in favor of a solution to issue #1374

@bw4sz bw4sz closed this Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants