Skip to content

PSv2: Implement queue clean-up upon job completion#1113

Merged
mihow merged 7 commits intoRolnickLab:mainfrom
uw-ssec:carlosg/cleannats
Feb 7, 2026
Merged

PSv2: Implement queue clean-up upon job completion#1113
mihow merged 7 commits intoRolnickLab:mainfrom
uw-ssec:carlosg/cleannats

Conversation

@carlosgjs
Copy link
Collaborator

@carlosgjs carlosgjs commented Feb 3, 2026

Summary

Performs clean-up of NATS and Redis resources used by async jobs

Related Issues

Closes #1083

Testing

NATS Dashboard With job running:

image

After job finished:

image

Redis with job running:

127.0.0.1:6379> KEYS *job*
1) ":1:job:66:pending_images:results"
2) ":1:job:66:pending_images:process"
3) ":1:job:66:pending_images_total"

With job complete:

127.0.0.1:6379> KEYS *job*
(empty array)

Job logs:

[2026-02-03 21:58:56] INFO Cleaned up NATS resources for job 66
[2026-02-03 21:58:56] INFO Cleaned up Redis state for job 66

Checklist

  • I have tested these changes appropriately.
  • I have added and/or modified relevant tests.
  • I updated relevant documentation or comments.
  • I have verified that this PR follows the project's coding standards.
  • Any dependent changes have already been merged to main.

Summary by CodeRabbit

  • Bug Fixes

    • Improved cleanup of async resources for ML jobs to ensure proper resource management when jobs complete, fail, or are revoked. Resources are now reliably released across all job completion states.
  • Tests

    • Added comprehensive tests to validate resource cleanup across success, failure, and revocation scenarios.

@netlify
Copy link

netlify bot commented Feb 3, 2026

Deploy Preview for antenna-ssec canceled.

Name Link
🔨 Latest commit 165fa96
🔍 Latest deploy log https://app.netlify.com/projects/antenna-ssec/deploys/6986a9403dd1d3000745447c

@netlify
Copy link

netlify bot commented Feb 3, 2026

Deploy Preview for antenna-preview ready!

Name Link
🔨 Latest commit 165fa96
🔍 Latest deploy log https://app.netlify.com/projects/antenna-preview/deploys/6986a940ad484c000825787a
😎 Deploy Preview https://deploy-preview-1113--antenna-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.
Lighthouse
Lighthouse
1 paths audited
Performance: 60 (🔴 down 6 from production)
Accessibility: 80 (no change from production)
Best Practices: 100 (no change from production)
SEO: 92 (no change from production)
PWA: 80 (no change from production)
View the detailed breakdown and full score reports

To edit notification comments on pull requests, go to your Netlify project configuration.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 3, 2026

📝 Walkthrough

Walkthrough

This pull request implements cleanup of async job resources (NATS streams/consumers and Redis keys) when ML jobs complete, fail, or are revoked. The core cleanup logic is refactored into a unified function, integrated into task status handlers, and validated with comprehensive tests covering all completion scenarios.

Changes

Cohort / File(s) Summary
Task Orchestration
ami/jobs/tasks.py
Added _cleanup_job_if_needed() helper that conditionally triggers async resource cleanup. Integrated cleanup invocations into _update_job_progress() (100% completion), update_job_status() (revocation), and update_job_failure() (failure). Added JobState import for state checking.
Cleanup Implementation
ami/ml/orchestration/jobs.py
Renamed cleanup_nats_resources() to cleanup_async_job_resources() and extended functionality to perform both Redis and NATS cleanup. Added TaskStateManager.cleanup() for Redis state removal and wrapped TaskQueueManager async context for NATS resource deletion. Updated return type to boolean indicating success of all cleanup operations.
Integration Tests
ami/ml/orchestration/tests/test_cleanup.py
New test module validating async resource cleanup across three job completion paths: successful completion (100% progress), failure, and revocation. Tests verify both Redis key presence/absence and NATS stream/consumer lifecycle through Django integration tests with mock pipelines and images.

Sequence Diagram

sequenceDiagram
    participant Job as Job Completion Event
    participant Tasks as tasks.py<br/>(Orchestration)
    participant Cleanup as jobs.py<br/>(cleanup_async_job_resources)
    participant Redis as TaskStateManager<br/>(Redis)
    participant NATS as TaskQueueManager<br/>(NATS)
    
    Note over Job,NATS: Completion triggered by: progress==100% OR failure OR revocation
    
    Job->>Tasks: _update_job_progress() / update_job_status() / update_job_failure()
    Tasks->>Tasks: Check if job_type=="ml" & async_pipeline_workers enabled
    alt Cleanup needed
        Tasks->>Cleanup: _cleanup_job_if_needed(job)
        Cleanup->>Redis: cleanup() - remove task state keys
        activate Redis
        Redis-->>Cleanup: redis_success (bool)
        deactivate Redis
        Cleanup->>NATS: TaskQueueManager context - delete streams/consumers
        activate NATS
        NATS-->>Cleanup: nats_success (bool)
        deactivate NATS
        Cleanup-->>Tasks: return redis_success AND nats_success
    else No cleanup needed
        Tasks->>Tasks: Skip cleanup
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

🐰 Hop, hop—when jobs complete their dance,
We scrub the Redis, clean the NATS!
No streams left hanging, keys now gone,
Fresh resources for jobs to come.

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 58.06% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: implementing cleanup of NATS/Redis queues when async jobs complete.
Description check ✅ Passed The description includes a summary of changes, references the related issue (#1083), provides comprehensive testing evidence with NATS/Redis screenshots and job logs, and completes all checklist items.
Linked Issues check ✅ Passed The PR successfully implements the core requirement from #1083: cleanup of NATS queues and Redis resources upon job completion, with cleanup triggered at three job endpoints (completion, failure, revocation).
Out of Scope Changes check ✅ Passed All changes are directly related to implementing job resource cleanup: new cleanup helper function, renamed/extended orchestration method, and comprehensive integration tests validating the cleanup functionality.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@carlosgjs carlosgjs marked this pull request as ready for review February 4, 2026 03:03
Copilot AI review requested due to automatic review settings February 4, 2026 03:03
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements automatic cleanup of NATS JetStream and Redis resources when async ML jobs complete, fail, or are cancelled. This addresses issue #1083 by ensuring that temporary resources used for job orchestration are properly removed after jobs finish.

Changes:

  • Renamed cleanup_nats_resources to cleanup_async_job_resources to handle both NATS and Redis cleanup
  • Integrated cleanup into job lifecycle at three points: completion, failure, and revocation
  • Added comprehensive integration tests covering all three cleanup scenarios

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
ami/ml/orchestration/jobs.py Enhanced cleanup function to handle both Redis and NATS resources, with proper error handling and logging
ami/jobs/tasks.py Integrated cleanup calls in job completion, failure, and revocation handlers with feature flag checks
ami/ml/orchestration/test_cleanup.py Added comprehensive integration tests verifying cleanup works correctly in all three scenarios

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@ami/ml/orchestration/test_cleanup.py`:
- Around line 118-159: In _verify_resources_cleaned, change the broad exception
handling inside the async check_nats_resources (which calls
manager.js.stream_info and manager.js.consumer_info via TaskQueueManager) to
only treat nats.js.errors.NotFoundError as "not found" (set
stream_exists/consumer_exists = False) and re-raise any other exceptions so
connection/infra errors fail the test; import or reference NotFoundError from
nats.js.errors and use it in the except clauses for the respective stream and
consumer checks.
🧹 Nitpick comments (1)
ami/ml/orchestration/jobs.py (1)

33-53: Capture stack traces on cleanup failures for easier diagnosis.

job.logger.error drops the traceback; job.logger.exception preserves context without changing behavior.

🔧 Suggested update
-    except Exception as e:
-        job.logger.error(f"Error cleaning up Redis state for job {job.pk}: {e}")
+    except Exception:
+        job.logger.exception(f"Error cleaning up Redis state for job {job.pk}")
...
-    except Exception as e:
-        job.logger.error(f"Error cleaning up NATS resources for job {job.pk}: {e}")
+    except Exception:
+        job.logger.exception(f"Error cleaning up NATS resources for job {job.pk}")

@carlosgjs carlosgjs requested a review from mihow February 5, 2026 01:37
@mihow mihow merged commit c92a9f7 into RolnickLab:main Feb 7, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PSv2: Implement queue clean-up upon job completion

3 participants