fix(spp_api_v2): serialize concurrent fastapi_endpoint sync across workers#175
Conversation
…rkers After a registry reload (e.g. -u all) every Odoo worker independently invokes routing_map(), and each one races to UPDATE the same fastapi_endpoint rows when calling action_sync_registry(). Under PostgreSQL's REPEATABLE READ isolation this surfaces as N-1 noisy "could not serialize access due to concurrent update" ERRORs in the SQL log before the outer try/except swallows them. Gate the sync block with a pg_try_advisory_xact_lock keyed off a stable SHA-256-derived int8 so only one worker performs the sync per reload. The lock is transaction-scoped and released automatically at COMMIT. Losers log a DEBUG line and skip; the registry-cache invalidation signal from the winning worker propagates to them via the standard Odoo signaling mechanism. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a transaction-scoped advisory lock to serialize FastAPI endpoint synchronization across multiple workers. By using pg_try_advisory_xact_lock with a deterministic 64-bit key derived from the module name, the implementation prevents SerializationFailure errors during concurrent registry reloads. I have no feedback to provide.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## 19.0 #175 +/- ##
==========================================
+ Coverage 71.48% 71.63% +0.15%
==========================================
Files 932 933 +1
Lines 54840 55398 +558
==========================================
+ Hits 39201 39686 +485
- Misses 15639 15712 +73
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
@haklyray can I work on the fixes in the tests or you will do it so I could merge this? |
…ordering, observability Refines b2cd2e3 (advisory-lock serialization of FastAPI endpoint sync) based on staff-engineer review: - Extract `_try_acquire_fastapi_sync_lock(cr)` helper. Failures of the lock SQL itself (e.g. exhausted shared-lock memory) now log WARNING and return False (skip sync) instead of being silently swallowed by the surrounding broad `except Exception` at DEBUG. Fail-closed keeps callers safe from the race the patch exists to prevent. - Move success-path `_logger.info` ABOVE `cr.commit()`. The advisory lock is xact-scoped and is released at commit; nothing below the commit should do sync work. Comment marks the boundary so future edits don't silently regress. - Promote the skip-path log from DEBUG to INFO so cold-start route- availability symptoms are diagnosable at default log level. Fires at most once per registry reload per worker. - Document why skipping is safe (the loser's degraded routing map is cached under the old endpoint_route_version and naturally invalidated when the winner commits action_sync_registry()). - Bump version 19.0.2.0.0 -> 19.0.2.0.1, update HISTORY.md. - Add test_ir_http_patch.py: lock-key range/determinism, helper True/False/raise paths, and a real cross-backend Postgres primitive test that holds pg_advisory_xact_lock on one connection and verifies pg_try_advisory_xact_lock returns false on another.
|
@haklyray went ahead and added the test coverage in
Same commit also folded in a few small hardening points from a staff-engineer review pass — fail-closed lock helper (lock-SQL errors log Local: |
The oca-gen-addon-readme pre-commit hook regenerates README.rst and static/description/index.html from spp_api_v2/readme/ fragments. The prior commit added a 19.0.2.0.1 entry to HISTORY.md but didn't run the generator, so CI's pre-commit job failed with "files were modified by this hook." This commit lands the regenerated outputs. Also drops an unused LOGGER_NAME constant from test_ir_http_patch.py that was only needed by an integration test attempt I abandoned (the routing_map lock-acquired-and-do-work success path is hard to exercise under Odoo's test cursor / savepoint mechanics; happy to revisit in a follow-up). The five remaining tests all pass.
…ed helper Adds TestRoutingMapSyncBranches with two integration tests that drive routing_map's lock-check call site through both branches: - test_skip_branch_when_helper_returns_false: patches the helper to return False, asserts INFO 'sync skipped' log fires and routing map is still built. - test_sync_branch_when_helper_returns_true: patches the helper to return True, asserts no skip log fires and the broad-except didn't swallow an error inside the sync body. We patch the helper at the module level rather than coordinate a real cross-backend advisory lock because Odoo's test mode wraps registry .cursor() calls in a TestCursor that shares the test backend, so pg_advisory_xact_lock is naturally re-entrant within the test thread and the 'lock held elsewhere' branch is impossible to reproduce from a same-thread test. Patching exercises both branches deterministically. Pushes patch coverage on the routing_map sync block to the previously uncovered call site (line 128: `if not _try_acquire_fastapi_sync_lock(cr):`) and the else branch entry (env=, search=). Targets the codecov gate.
Split the long f-string assertion message in test_skip_branch... so the line stays under ruff's 120-char limit. Also picks up ruff-format's single-line-with-statement collapse on the patch.object() block.
Adds two more tests to TestRoutingMapSyncBranches that exercise the previously-uncovered success-path body in routing_map's sync block: - test_sync_branch_invokes_action_sync_when_unsynced_present: verifies that when unsynced endpoints exist, action_sync_registry is called and the 'Synced N FastAPI endpoints' INFO log fires (covers lines 169-171 and the cr.commit() at 181). - test_sync_branch_resyncs_orphan_endpoint_with_no_route: verifies the orphan-route detection loop — a synced endpoint whose endpoint.route was deleted gets re-flipped to registry_sync=False and rolled into the unsynced set for resync (covers lines 155-167). Mocks odoo.api.Environment to return a controlled fake env with a predictable search() to avoid Odoo's TestCursor / fresh-Environment isolation that prevented the earlier integration approach from seeing the test transaction's writes. This should bring patch coverage past the 70% codecov gate.
emjay0921
left a comment
There was a problem hiding this comment.
All checks green. The advisory-lock approach is the right shape for this race — deterministic 64-bit key derived from a stable hash so workers across processes converge on the same lock, transaction-scoped so it's released even if the worker dies mid-sync. The 334-line test file covers the contention, lock-acquisition timing, and the noise-suppression behaviour. Ready to merge.
Why is this change needed?
After
Upgrade All Modules(or any other event that triggers a registry reload), every Odoo worker independently invokes the patchedir.http.routing_map(). Each worker enters the FastAPI endpoint sync block and races toUPDATE fastapi_endpoint SET registry_sync = TRUE WHERE id IN (1, 2). Under PostgreSQL's REPEATABLE READ isolation, all but one of those concurrent updates abort with:odoo.sql_dblogs each failed query at ERROR before the outertry/except Exceptionswallows it at DEBUG. With N workers, that produces N−1 noisy ERRORs in the SQL log on every registry reload — alarming to operators, and it makes real errors harder to spot. The system still works (one worker succeeds, hence the trailingSynced N FastAPI endpointsinfo line), but the noise is unnecessary and N−1 transactions are wasted on every cold start.How was the change implemented?
Two commits on this branch:
Commit 1 —
b2cd2e3d— original advisory-lock fixGate the entire FastAPI endpoint sync block in
spp_api_v2/models/ir_http_patch.pybehind a PostgreSQL transaction-scoped advisory lock:_FASTAPI_SYNC_ADVISORY_LOCK_KEYis a deterministic 64-bit signed int derived fromSHA-256("spp_api_v2.fastapi_endpoint_sync")— stable across processes and unlikely to collide with any other module's advisory lock in the same database.with registry.cursor() as cr:block, the cursor first callsSELECT pg_try_advisory_xact_lock(<key>). The non-blockingtry_variant means losers return immediately rather than serialize on the lock and then re-attempt the same losingUPDATE.else:branch).COMMITorROLLBACKwhen the cursor block exits — no manual unlock and no risk of leaking state.This is safe because skipping workers don't end up with a broken routing map: the lock-holder's
action_sync_registry()bumpsendpoint_route_version, which is part of the routing-map cache key. Any degraded routing map a loser caches now is keyed at the old version and is naturally invalidated on the next call after the winner commits. Bad window is bounded by winner-commit latency (seconds at most).Commit 2 —
25ae3ccf— hardening from staff-engineer review_try_acquire_fastapi_sync_lock(cr)helper. Failures of the lock SQL itself (e.g. exhausted shared-lock memory, permissions) now logWARNINGand returnFalse(skip sync) instead of being silently swallowed by the surrounding broadexcept ExceptionatDEBUG. Fail-closed keeps callers safe from the race the patch exists to prevent.cr.commit(). The advisory lock is xact-scoped and is released at commit; nothing meaningful should run below it. A# do not add sync work below this linecomment marks the boundary so future edits don't silently regress the serialization.DEBUGtoINFO. Cold-start route-availability symptoms are now diagnosable at default log level. Fires at most once per registry reload per worker.19.0.2.0.0→19.0.2.0.1, added areadme/HISTORY.mdentry.New unit tests
spp_api_v2/tests/test_ir_http_patch.pyadds 6 tests across 3 classes:TestFastAPISyncAdvisoryLockKey(2): lock key fits Postgresbigintrange; key is deterministic from the SHA-256 source.TestTryAcquireFastAPISyncLock(4):Truewhen Postgres grants the lock (mock cursor)Falsewhen the lock is held elsewhere (mock cursor)Falseand logsWARNINGwhen the lock SQL itself raises (covers the new fail-closed path)pg_advisory_xact_lockon one Odoo cursor and verifies a second cursor'spg_try_advisory_xact_lockreturnsFalse. Includes apg_backend_pid()sanity check so the test fails loudly if the pool config ever drifts and both cursors land on the same backend.This addresses the codecov gate that was failing on the prior commit (50% patch coverage).
Unit tests executed by the author
Confirmed all 6 new tests in
test_ir_http_patchran (verified by grepping the test log).How to test manually
Reproduce the baseline (without this PR). Start the devcontainer and run:
Expect multiple
ERROR: could not serialize access due to concurrent updatelines (one per losing worker), followed by a singleSynced N FastAPI endpoints for database <db>.Apply this PR and rerun the same command. Expect zero
serialize accesserrors and a singleSynced N FastAPI endpointsline.Confirm the skip path runs (proves it's the lock doing the work, not a coincidence). The skip log is now at
INFO, so default log level is sufficient:Expect ≥1
FastAPI endpoint sync skipped … another worker is syncingINFO line per registry reload (one per losing worker).Functional smoke test — endpoints must still actually be registered after the upgrade:
Expect a non-404 response (200/401/403 depending on the route's auth — anything except 404 confirms routing is live).
Single-worker regression — start Odoo with
--workers=0and confirm sync still happens (the only worker always acquires the lock):Related links
pg_try_advisory_xact_lock)spp_api_v2/models/ir_http_patch.py— the patchedrouting_map()workaround for the Odoo 19 ormcache bugUPDATE:endpoint_route_handler/models/endpoint_route_sync_mixin.py—EndpointRouteSyncMixin.writeflipsregistry_syncand queues the post-commit hook