fix: Properly handle async job state with celery tasks #1114
fix: Properly handle async job state with celery tasks #1114mihow merged 6 commits intoRolnickLab:mainfrom
Conversation
👷 Deploy request for antenna-ssec pending review.Visit the deploys page to approve it
|
✅ Deploy Preview for antenna-preview canceled.
|
📝 WalkthroughWalkthroughAdded a JobProgress.is_complete() method and a guard in the Celery job status updater to prevent marking jobs as SUCCESS until all stages are complete; added tests for guard behavior and removed a planning doc describing the async status-handling proposal. Changes
Sequence Diagram(s)(No sequence diagrams generated.) Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This pull request fixes a race condition where jobs are incorrectly marked as SUCCESS before all asynchronous stages complete. The fix adds a guard to the Celery task_postrun signal handler that prevents premature SUCCESS status updates by checking if all job stages have truly finished processing.
Changes:
- Added
is_complete()method toJobProgressto check if all stages have finished (progress >= 1.0 and status in final states) - Updated
update_job_status()signal handler to guard against setting SUCCESS unless all stages are complete - Added comprehensive tests to verify the guard prevents premature SUCCESS while allowing FAILURE/REVOKED states through immediately
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| ami/jobs/models.py | Added is_complete() method to JobProgress class to check if all stages have finished processing with both progress and status checks |
| ami/jobs/tasks.py | Added guard in update_job_status() to prevent setting SUCCESS status unless is_complete() returns True; imported JobState for proper comparisons |
| ami/jobs/tests.py | Added two new test methods to verify the guard prevents premature SUCCESS status and allows FAILURE states through immediately |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Sweet! @carlos-irreverentlabs did you test this in the VISS environment? This looks very close to what was added in this plan https://github.com/RolnickLab/antenna/blob/main/.agents/planning/async-job-status-handling.md Did you refer to that, or is this just total super alignment?? I think we can delete the planning file in this PR as well. |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
I tested this locally
Yeah, this is a plan saved when we decided to defer the clean-up work in the initial PR. This PR is an implementation of the plan.
|
|
@mihow I've deleted the plan but I'd like to do a test of a non-ML job to make sure there are no adverse effects. I'll update the PR once I do that. |
|
Excellent thanks @carlosgjs ! |
mihow
left a comment
There was a problem hiding this comment.
This looks good! If we need to we can use the new async_api backend type on the Job model too!
Summary
This pull request improves the reliability of job status updates by ensuring that a job is only marked as SUCCESS when all its stages are truly complete. It adds a guard to prevent premature SUCCESS status in cases where asynchronous workers are still processing, and includes new tests to verify this behavior. Additionally, it allows FAILURE and REVOKED states to be set immediately, regardless of stage progress.
Job completion logic improvements:
Added a new
is_completemethod to theJobmodel, which checks that all stages have both finished processing (progress >= 1.0) and reached a final state (SUCCESS, FAILURE, or REVOKED). This method is used to determine if a job is truly complete before setting its status to SUCCESS.Updated the
update_job_statusCelery signal handler to guard against setting the job status to SUCCESS unlessis_completereturns True, preventing race conditions where the job could be marked as complete before all stages finish. FAILURE and REVOKED states bypass this guard and are set immediately.Testing and validation:
Code maintenance:
JobStatefor proper state comparisons.Related Issues
Closes #1084
Testing
Before the fix, the job is incorrectly set as successful, which makes the progress bar green and enables the Retry button:


After the fix: the job remains as pending, so the progress bar is yellow and the Cancel button is enabled:
Additional testing of populating a collection to ensure the fix doesn't affect non-ML jobs:
Verified in the debugger the additional check is not triggered in this case:

Checklist
Summary by CodeRabbit
Bug Fixes
Tests
Documentation