Skip to content

fix: Properly handle async job state with celery tasks #1114

Merged
mihow merged 6 commits intoRolnickLab:mainfrom
uw-ssec:carlosg/jobstate
Feb 7, 2026
Merged

fix: Properly handle async job state with celery tasks #1114
mihow merged 6 commits intoRolnickLab:mainfrom
uw-ssec:carlosg/jobstate

Conversation

@carlosgjs
Copy link
Collaborator

@carlosgjs carlosgjs commented Feb 4, 2026

Summary

This pull request improves the reliability of job status updates by ensuring that a job is only marked as SUCCESS when all its stages are truly complete. It adds a guard to prevent premature SUCCESS status in cases where asynchronous workers are still processing, and includes new tests to verify this behavior. Additionally, it allows FAILURE and REVOKED states to be set immediately, regardless of stage progress.

Job completion logic improvements:

  • Added a new is_complete method to the Job model, which checks that all stages have both finished processing (progress >= 1.0) and reached a final state (SUCCESS, FAILURE, or REVOKED). This method is used to determine if a job is truly complete before setting its status to SUCCESS.

  • Updated the update_job_status Celery signal handler to guard against setting the job status to SUCCESS unless is_complete returns True, preventing race conditions where the job could be marked as complete before all stages finish. FAILURE and REVOKED states bypass this guard and are set immediately.

Testing and validation:

  • Added new tests to verify that the SUCCESS status is only set when all stages are complete, and that FAILURE and REVOKED states are set immediately even if stages are incomplete. These tests help ensure the new guard logic works as intended.

Code maintenance:

  • Updated imports in the Celery task handlers to include JobState for proper state comparisons.

Related Issues

Closes #1084

Testing

Before the fix, the job is incorrectly set as successful, which makes the progress bar green and enables the Retry button:
image
After the fix: the job remains as pending, so the progress bar is yellow and the Cancel button is enabled:
image

Additional testing of populating a collection to ensure the fix doesn't affect non-ML jobs:

image image

Verified in the debugger the additional check is not triggered in this case:
image

Checklist

  • I have tested these changes appropriately.
  • I have added and/or modified relevant tests.
  • I updated relevant documentation or comments.
  • I have verified that this PR follows the project's coding standards.
  • Any dependent changes have already been merged to main.

Summary by CodeRabbit

  • Bug Fixes

    • Prevented jobs from being marked as successful before all stages finish; error/retry/revoke states still apply immediately.
    • Ensures final completion is based on full stage completion rather than task timing.
  • Tests

    • Added tests validating the completion guard and immediate error-state behavior.
  • Documentation

    • Removed an internal planning document about async job-status handling.

@netlify
Copy link

netlify bot commented Feb 4, 2026

👷 Deploy request for antenna-ssec pending review.

Visit the deploys page to approve it

Name Link
🔨 Latest commit b0d25f0

@netlify
Copy link

netlify bot commented Feb 4, 2026

Deploy Preview for antenna-preview canceled.

Name Link
🔨 Latest commit b0d25f0
🔍 Latest deploy log https://app.netlify.com/projects/antenna-preview/deploys/69856f257b2d0d0008481449

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 4, 2026

📝 Walkthrough

Walkthrough

Added a JobProgress.is_complete() method and a guard in the Celery job status updater to prevent marking jobs as SUCCESS until all stages are complete; added tests for guard behavior and removed a planning doc describing the async status-handling proposal.

Changes

Cohort / File(s) Summary
Completion logic
ami/jobs/models.py
Added JobProgress.is_complete() which returns True only when there is at least one stage and every stage has progress >= 1.0 and a final state.
Status update guard
ami/jobs/tasks.py
Imported JobState and added a guard in update_job_status(): if incoming state is SUCCESS and job.progress.is_complete() is False, skip updating to SUCCESS; other states (FAILURE/REVOKED/RETRY) are unaffected.
Tests
ami/jobs/tests.py
Added two tests: one ensuring premature SUCCESS is prevented for incomplete jobs, and one ensuring FAILURE state updates immediately regardless of completeness.
Removed planning doc
.agents/planning/async-job-status-handling.md
Deleted planning/documentation file that described the async job-status handling design and implementation steps. Review for lost context or migrated notes.

Sequence Diagram(s)

(No sequence diagrams generated.)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • mihow

Poem

🐇 I watched the stages hop and hum,
Not marking done till every drum—
Progress counted, stage by stage,
No premature finish on the page.
Hoppity cheers for the guarded run! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: fixing premature async job status handling by implementing a guard to prevent marking jobs as SUCCESS before all stages complete.
Linked Issues check ✅ Passed The PR successfully implements the core requirement from #1084: decoupling celery task state from job state by adding is_complete() logic and guarding SUCCESS status until all stages are truly complete.
Out of Scope Changes check ✅ Passed All changes are directly scoped to issue #1084: is_complete() method in JobProgress, guard logic in update_job_status, corresponding tests, and deletion of the now-implemented planning document.
Description check ✅ Passed The PR description includes all required template sections: summary, list of changes, related issues, detailed description, testing instructions with screenshots, and checklist.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@carlosgjs carlosgjs marked this pull request as ready for review February 4, 2026 19:12
Copilot AI review requested due to automatic review settings February 4, 2026 19:12
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request fixes a race condition where jobs are incorrectly marked as SUCCESS before all asynchronous stages complete. The fix adds a guard to the Celery task_postrun signal handler that prevents premature SUCCESS status updates by checking if all job stages have truly finished processing.

Changes:

  • Added is_complete() method to JobProgress to check if all stages have finished (progress >= 1.0 and status in final states)
  • Updated update_job_status() signal handler to guard against setting SUCCESS unless all stages are complete
  • Added comprehensive tests to verify the guard prevents premature SUCCESS while allowing FAILURE/REVOKED states through immediately

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
ami/jobs/models.py Added is_complete() method to JobProgress class to check if all stages have finished processing with both progress and status checks
ami/jobs/tasks.py Added guard in update_job_status() to prevent setting SUCCESS status unless is_complete() returns True; imported JobState for proper comparisons
ami/jobs/tests.py Added two new test methods to verify the guard prevents premature SUCCESS status and allows FAILURE states through immediately

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mihow
Copy link
Collaborator

mihow commented Feb 4, 2026

Sweet! @carlos-irreverentlabs did you test this in the VISS environment?

This looks very close to what was added in this plan https://github.com/RolnickLab/antenna/blob/main/.agents/planning/async-job-status-handling.md

Did you refer to that, or is this just total super alignment??

I think we can delete the planning file in this PR as well.

@carlosgjs carlosgjs requested a review from mihow February 5, 2026 01:37
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@carlosgjs
Copy link
Collaborator Author

Sweet! @carlos-irreverentlabs did you test this in the VISS environment?

I tested this locally

This looks very close to what was added in this plan https://github.com/RolnickLab/antenna/blob/main/.agents/planning/async-job-status-handling.md

Did you refer to that, or is this just total super alignment??

Yeah, this is a plan saved when we decided to defer the clean-up work in the initial PR. This PR is an implementation of the plan.

I think we can delete the planning file in this PR as well.
Good call, I'll remove it

@carlosgjs
Copy link
Collaborator Author

carlosgjs commented Feb 6, 2026

@mihow I've deleted the plan but I'd like to do a test of a non-ML job to make sure there are no adverse effects. I'll update the PR once I do that.

@carlosgjs
Copy link
Collaborator Author

@mihow I've deleted the plan but I'd like to do a test of a non-ML job to make sure there are no adverse effects. I'll update the PR once I do that.

@mihow - Added the additional testing to the PR description.

@mihow
Copy link
Collaborator

mihow commented Feb 6, 2026

Excellent thanks @carlosgjs !

Copy link
Collaborator

@mihow mihow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! If we need to we can use the new async_api backend type on the Job model too!

@mihow mihow merged commit fe0c337 into RolnickLab:main Feb 7, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PSv2: Decouple the celery task state from the job state

3 participants