fix(spp_programs): release stuck cycle/program lock when async pipeline fails#188
fix(spp_programs): release stuck cycle/program lock when async pipeline fails#188reichie020212 wants to merge 3 commits into19.0from
Conversation
…ne fails Async entitlement / cycle / payment / enrollment pipelines acquired is_locked=True before scheduling a queue.job group, but only released the flag inside the on_done callback. If any child job failed, the on_done was cascade-failed (never executed) and the cycle remained "Operation in progress" forever — even though no active queue jobs were left to drive it. Same vulnerability on spp.program for the enrollment flow. Wires a paired on_error() callback at every async site so the lock is cleared on both success and failure paths. Adds mark_*_as_failed companions for each existing mark_*_as_done, and hardens the success path so a chatter exception can't strand the lock either. Adds an admin-only action_force_unlock action (with audit chatter) on spp.cycle and spp.program for the residual case where neither callback fires at all (e.g. server killed mid-operation, pre-fix data). Requires job_worker >= 19.0.1.1.0 for the on_error() / run_on_failure semantics. Bumps spp_programs to 19.0.2.1.0. Signed-off-by: Red <redickbutay02@gmail.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## 19.0 #188 +/- ##
==========================================
- Coverage 71.63% 71.55% -0.08%
==========================================
Files 933 941 +8
Lines 55370 55030 -340
==========================================
- Hits 39664 39379 -285
+ Misses 15706 15651 -55
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Code Review
This pull request implements a comprehensive lock recovery mechanism for asynchronous pipelines across programs, cycles, and payments. Key changes include the addition of on_error callbacks to release 'Operation in progress' locks during job failures, wrapping chatter notifications in try-except blocks to ensure locks are cleared even if messaging fails, and introducing a 'Force Unlock' button for administrators. New unit tests verify these recovery paths. Review feedback suggests adding self.ensure_one() to the eligibility check completion and failure methods in the cycle manager to maintain consistency with other manager methods.
Two issues: 1. test_async_lock_recovery setUp didn't satisfy spp_cycle's NOT NULL start_date constraint and didn't request default managers, so 7 of the new tests errored on setUp. Helpers now pass start_date / end_date and use create_default_managers=True context (as the create-program wizards do) so get_manager(...) returns a record instead of None. 2. mark_check_eligibility_as_done and mark_check_eligibility_as_failed now call self.ensure_one() to match the rest of the mark_* family — raised by gemini-code-assist review. Signed-off-by: Red <redickbutay02@gmail.com>
emjay0921
left a comment
There was a problem hiding this comment.
LGTM — the core diagnosis (cascade-failed on_done strands the lock) is correct, and the fix surface is comprehensive. CI is green except codecov/patch, which is a coverage-percentage gate on the new mark_*_as_failed methods rather than a real failure — I wouldn't block on it.
What I verified:
on_error()wiring is complete. Everymain_job.on_done(...)callsite (cycle import / prepare_entitlement / check_eligibility, entitlement set_pending / validate / cancel — base, cash, in-kind variants, payment prepare / send, program enroll_eligible) now has a pairedon_error(...). No site I can find ships the on_done without the on_error.mark_*_as_donehardening is the right call. Switching tocycle.write({"is_locked": False, "locked_reason": False})beforemessage_post, then catching exceptions around the chatter post, means even a chatter-side failure (e.g. transient mail-thread error) can't leave the lock set. That's the same bug pattern the on_error fix targets, just on the success path.action_force_unlockis appropriately gated.groups="base.group_system"on the buttons + the confirm dialog + audit chatter line is exactly what an escape hatch should look like. The no-op-when-unlocked branch is also tested.- Tests cover all four manager classes plus the cycle/program force-unlock action. They use
TransactionCaseso rollback is clean. They don't drive a real async pipeline through to failure (which would need the full job_worker infra) — that's acceptable; the unit-level guarantees on the callbacks + the sites being correctly wired is a reasonable trade-off.
Three small notes (none blocking):
- Stats-refresh asymmetry on the program path.
mark_enroll_eligible_as_donerefreshes_compute_eligible_beneficiary_count+_compute_beneficiary_count; theas_failedcounterpart doesn't. If a partial enrollment lands before the group fails, the displayed counts could be a touch stale until the next compute trigger. Probably not worth a follow-up given how rare partial async failure should be in practice, but worth noting. except Exceptionaroundmessage_postis broad. Intentional and correct here (better to log the chatter problem than strand the lock), but a Sentry-grade observability layer would want a specific exception class to filter on. Future thing.programs_view.xmlForce Unlock button indentation is one level shallow — the new<button>opens at the column of<div>, not aligned with the prior<button>. Pre-commit accepted it, so non-blocking, but aprettierre-flow would tidy it.
Approving.
…r codecov Adds 17 tests across 4 new classes: - TestMarkAsDoneHardenedPaths (6): exercises every mark_*_as_done body for cycle (import / prepare / check_eligibility), entitlement, payment, and program. These are the lock-clear-before-chatter + try/except branches added in this PR that codecov reported as 0% covered. - TestMarkAsFailedChatterContent (5): pins each mark_*_as_failed test to a body-content assertion so the cycle.message_post(body=msg) line carries a behavioral check. - TestForceUnlockOnCycle (4): noop, audit-records-user, and (none) placeholder branches for spp.cycle.action_force_unlock. - TestForceUnlockOnProgramExtra (2): noop and (none) placeholder branches for spp.program.action_force_unlock that the existing class skipped. All 25 tests in the file pass: 0 failed, 0 error(s) of 25 tests.
|
@gonzalesedwin1123 — pushed What was added (all in
What I deliberately didn't cover: the Local results: Mind giving this another look when you have a moment? |
Summary
main_job.on_error(...)callback at every async site socycle.is_locked(andprogram.is_locked) is cleared on both success and failure paths — not only on success.mark_*_as_failedcompanions for every existingmark_*_as_done, and hardens the success-path methods so a chatter exception can't strand the lock either.action_force_unlock(groupbase.group_system) onspp.cycleandspp.program, with audit chatter, for the residual case where neither callback fires at all (e.g. server killed mid-operation, or pre-fix data still stuck on production).Why
The async entitlement / cycle / payment / enrollment pipelines acquire
is_locked=Truebefore scheduling aqueue.jobgroup, and only release the flag inside the on_done callback. Today, if any child job in the group fails, the on_done is cascade-failed and never executes — the cycle stays "Operation in progress" forever, even though no active jobs are left to drive it. The view (cycle_view.xml:238-249) renders the alert purely fromis_lockedand never inspects the queue, so users see a stuck warning.Reported externally: a 4Ps cycle was stuck in "Operation in progress: Set entitlements to pending validation for cycle" while the Queue Jobs view (filtered to active states
waiting/pending/started) showed nothing — the failed jobs were infailedstate, hidden by the default filter.Sites wired
main_job.on_error(...)added next to every existingmain_job.on_done(...)— 9 call sites:entitlement_manager_base.py:_set_pending_validation_entitlements_async,_validate_entitlements_async,_cancel_entitlements_asyncentitlement_manager_inkind.py:_set_pending_validation_entitlements_async,_validate_entitlements_asyncentitlement_manager_cash.py:_validate_entitlements_asynccycle_manager_base.py:_check_eligibility_async,_prepare_entitlements_async,_add_beneficiaries_asyncpayment_manager.py:_prepare_payments_async,_send_payments_asyncprogram_manager.py:_enroll_eligible_registrants_asyncHardening of the success path
Every
mark_*_as_donemethod now writesis_locked=Falsebefore callingmessage_post, withmessage_postwrapped intry/exceptthat logs to_logger.exception. Previously a chatter failure here would also strand the lock.Force Unlock escape hatch
When neither callback can fire — server killed mid-operation, or the lock was set before this fix landed — a manager-only "Force Unlock" button appears next to the "Operation in progress" alert. It clears
is_locked/locked_reasonand posts an audit chatter line naming the user and the prior reason.Dependency
Requires
job_worker >= 19.0.1.1.0foron_error()/run_on_failuresemantics. Companion PR: https://github.com/OpenSPP/odoo-job-worker/pull/new/fix/job-worker-on-error-callback. Manifest bumped 19.0.2.0.11 → 19.0.2.1.0.