Skip to content

fix(spp_api_v2): serialize concurrent fastapi_endpoint sync across workers#175

Merged
gonzalesedwin1123 merged 6 commits into19.0from
fix/spp-api-v2-fastapi-endpoint-sync-race
May 4, 2026
Merged

fix(spp_api_v2): serialize concurrent fastapi_endpoint sync across workers#175
gonzalesedwin1123 merged 6 commits into19.0from
fix/spp-api-v2-fastapi-endpoint-sync-race

Conversation

@haklyray
Copy link
Copy Markdown
Contributor

@haklyray haklyray commented Apr 28, 2026

Why is this change needed?

After Upgrade All Modules (or any other event that triggers a registry reload), every Odoo worker independently invokes the patched ir.http.routing_map(). Each worker enters the FastAPI endpoint sync block and races to UPDATE fastapi_endpoint SET registry_sync = TRUE WHERE id IN (1, 2). Under PostgreSQL's REPEATABLE READ isolation, all but one of those concurrent updates abort with:

ERROR: could not serialize access due to concurrent update

odoo.sql_db logs each failed query at ERROR before the outer try/except Exception swallows it at DEBUG. With N workers, that produces N−1 noisy ERRORs in the SQL log on every registry reload — alarming to operators, and it makes real errors harder to spot. The system still works (one worker succeeds, hence the trailing Synced N FastAPI endpoints info line), but the noise is unnecessary and N−1 transactions are wasted on every cold start.

How was the change implemented?

Two commits on this branch:

Commit 1 — b2cd2e3d — original advisory-lock fix

Gate the entire FastAPI endpoint sync block in spp_api_v2/models/ir_http_patch.py behind a PostgreSQL transaction-scoped advisory lock:

  • Module-scope constant _FASTAPI_SYNC_ADVISORY_LOCK_KEY is a deterministic 64-bit signed int derived from SHA-256("spp_api_v2.fastapi_endpoint_sync") — stable across processes and unlikely to collide with any other module's advisory lock in the same database.
  • Inside the existing with registry.cursor() as cr: block, the cursor first calls SELECT pg_try_advisory_xact_lock(<key>). The non-blocking try_ variant means losers return immediately rather than serialize on the lock and then re-attempt the same losing UPDATE.
  • If the lock is acquired, the existing sync logic runs unchanged (just indented one level deeper inside an else: branch).
  • If the lock is not acquired, the worker logs and skips the sync.
  • The lock is transaction-scoped, so it is released automatically at COMMIT or ROLLBACK when the cursor block exits — no manual unlock and no risk of leaking state.

This is safe because skipping workers don't end up with a broken routing map: the lock-holder's action_sync_registry() bumps endpoint_route_version, which is part of the routing-map cache key. Any degraded routing map a loser caches now is keyed at the old version and is naturally invalidated on the next call after the winner commits. Bad window is bounded by winner-commit latency (seconds at most).

Commit 2 — 25ae3ccf — hardening from staff-engineer review

  • Extracted _try_acquire_fastapi_sync_lock(cr) helper. Failures of the lock SQL itself (e.g. exhausted shared-lock memory, permissions) now log WARNING and return False (skip sync) instead of being silently swallowed by the surrounding broad except Exception at DEBUG. Fail-closed keeps callers safe from the race the patch exists to prevent.
  • Reordered success-path log above cr.commit(). The advisory lock is xact-scoped and is released at commit; nothing meaningful should run below it. A # do not add sync work below this line comment marks the boundary so future edits don't silently regress the serialization.
  • Promoted the skip-path log from DEBUG to INFO. Cold-start route-availability symptoms are now diagnosable at default log level. Fires at most once per registry reload per worker.
  • Documented why skipping is safe (the version-bump self-heal argument) inline near the skip log so future readers don't have to reconstruct it.
  • Bumped version 19.0.2.0.019.0.2.0.1, added a readme/HISTORY.md entry.

New unit tests

spp_api_v2/tests/test_ir_http_patch.py adds 6 tests across 3 classes:

  • TestFastAPISyncAdvisoryLockKey (2): lock key fits Postgres bigint range; key is deterministic from the SHA-256 source.
  • TestTryAcquireFastAPISyncLock (4):
    • returns True when Postgres grants the lock (mock cursor)
    • returns False when the lock is held elsewhere (mock cursor)
    • returns False and logs WARNING when the lock SQL itself raises (covers the new fail-closed path)
    • real cross-backend integration test that holds pg_advisory_xact_lock on one Odoo cursor and verifies a second cursor's pg_try_advisory_xact_lock returns False. Includes a pg_backend_pid() sanity check so the test fails loudly if the pool config ever drifts and both cursors land on the same backend.

This addresses the codecov gate that was failing on the prior commit (50% patch coverage).

Unit tests executed by the author

./scripts/test_single_module.sh spp_api_v2
# 712 tests, 0 failed, 0 errors

Confirmed all 6 new tests in test_ir_http_patch ran (verified by grepping the test log).

How to test manually

  1. Reproduce the baseline (without this PR). Start the devcontainer and run:

    python -m odoo -c odoo-dev.conf -u all --stop-after-init 2>&1 \
      | grep -E "(serialize access|Synced .* FastAPI)"

    Expect multiple ERROR: could not serialize access due to concurrent update lines (one per losing worker), followed by a single Synced N FastAPI endpoints for database <db>.

  2. Apply this PR and rerun the same command. Expect zero serialize access errors and a single Synced N FastAPI endpoints line.

  3. Confirm the skip path runs (proves it's the lock doing the work, not a coincidence). The skip log is now at INFO, so default log level is sufficient:

    python -m odoo -c odoo-dev.conf -u all --stop-after-init 2>&1 \
      | grep "FastAPI endpoint sync skipped"

    Expect ≥1 FastAPI endpoint sync skipped … another worker is syncing INFO line per registry reload (one per losing worker).

  4. Functional smoke test — endpoints must still actually be registered after the upgrade:

    curl -s -o /dev/null -w '%{http_code}\n' http://localhost:8069/api/v2/<any-registered-route>

    Expect a non-404 response (200/401/403 depending on the route's auth — anything except 404 confirms routing is live).

  5. Single-worker regression — start Odoo with --workers=0 and confirm sync still happens (the only worker always acquires the lock):

    python -m odoo -c odoo-dev.conf --workers=0 -u all --stop-after-init 2>&1 \
      | grep "Synced .* FastAPI"

Related links

…rkers

After a registry reload (e.g. -u all) every Odoo worker independently
invokes routing_map(), and each one races to UPDATE the same
fastapi_endpoint rows when calling action_sync_registry(). Under
PostgreSQL's REPEATABLE READ isolation this surfaces as N-1 noisy
"could not serialize access due to concurrent update" ERRORs in the SQL
log before the outer try/except swallows them.

Gate the sync block with a pg_try_advisory_xact_lock keyed off a stable
SHA-256-derived int8 so only one worker performs the sync per reload.
The lock is transaction-scoped and released automatically at COMMIT.
Losers log a DEBUG line and skip; the registry-cache invalidation
signal from the winning worker propagates to them via the standard
Odoo signaling mechanism.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a transaction-scoped advisory lock to serialize FastAPI endpoint synchronization across multiple workers. By using pg_try_advisory_xact_lock with a deterministic 64-bit key derived from the module name, the implementation prevents SerializationFailure errors during concurrent registry reloads. I have no feedback to provide.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 28, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 71.63%. Comparing base (98a45a9) to head (d2376a7).
⚠️ Report is 11 commits behind head on 19.0.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             19.0     #175      +/-   ##
==========================================
+ Coverage   71.48%   71.63%   +0.15%     
==========================================
  Files         932      933       +1     
  Lines       54840    55398     +558     
==========================================
+ Hits        39201    39686     +485     
- Misses      15639    15712      +73     
Flag Coverage Δ
spp_api_v2 80.33% <100.00%> (+0.22%) ⬆️
spp_api_v2_change_request 66.85% <ø> (ø)
spp_api_v2_cycles 71.12% <ø> (ø)
spp_api_v2_data 64.41% <ø> (ø)
spp_api_v2_entitlements 70.19% <ø> (ø)
spp_api_v2_gis 71.52% <ø> (ø)
spp_api_v2_products 66.27% <ø> (ø)
spp_api_v2_service_points 70.94% <ø> (ø)
spp_api_v2_simulation 71.12% <ø> (ø)
spp_api_v2_vocabulary 57.26% <ø> (ø)
spp_base_common 90.26% <ø> (ø)
spp_dci_client_dr 55.87% <ø> (ø)
spp_dci_client_ibr 60.17% <ø> (ø)
spp_dci_demo 69.23% <ø> (ø)
spp_dci_server 35.68% <ø> (ø)
spp_farmer_registry_demo 54.01% <ø> (+0.62%) ⬆️
spp_mis_demo_v2 73.48% <ø> (+3.46%) ⬆️
spp_programs 64.51% <ø> (ø)
spp_security 66.66% <ø> (ø)
spp_starter_social_registry 0.00% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
spp_api_v2/__manifest__.py 0.00% <ø> (ø)
spp_api_v2/models/ir_http_patch.py 89.77% <100.00%> (+13.45%) ⬆️

... and 6 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@gonzalesedwin1123
Copy link
Copy Markdown
Member

@haklyray can I work on the fixes in the tests or you will do it so I could merge this?

…ordering, observability

Refines b2cd2e3 (advisory-lock serialization of FastAPI endpoint sync)
based on staff-engineer review:

- Extract `_try_acquire_fastapi_sync_lock(cr)` helper. Failures of the
  lock SQL itself (e.g. exhausted shared-lock memory) now log WARNING
  and return False (skip sync) instead of being silently swallowed by
  the surrounding broad `except Exception` at DEBUG. Fail-closed keeps
  callers safe from the race the patch exists to prevent.
- Move success-path `_logger.info` ABOVE `cr.commit()`. The advisory
  lock is xact-scoped and is released at commit; nothing below the
  commit should do sync work. Comment marks the boundary so future
  edits don't silently regress.
- Promote the skip-path log from DEBUG to INFO so cold-start route-
  availability symptoms are diagnosable at default log level. Fires
  at most once per registry reload per worker.
- Document why skipping is safe (the loser's degraded routing map is
  cached under the old endpoint_route_version and naturally invalidated
  when the winner commits action_sync_registry()).
- Bump version 19.0.2.0.0 -> 19.0.2.0.1, update HISTORY.md.
- Add test_ir_http_patch.py: lock-key range/determinism, helper
  True/False/raise paths, and a real cross-backend Postgres primitive
  test that holds pg_advisory_xact_lock on one connection and verifies
  pg_try_advisory_xact_lock returns false on another.
@gonzalesedwin1123
Copy link
Copy Markdown
Member

@haklyray went ahead and added the test coverage in 25ae3ccf. Six tests in a new spp_api_v2/tests/test_ir_http_patch.py covering:

  • Lock key fits Postgres bigint and is SHA-256-deterministic
  • _try_acquire_fastapi_sync_lock helper: True/False/raise paths (mock cursor) + a real cross-backend Postgres test that holds pg_advisory_xact_lock on one connection and asserts the second connection's pg_try_advisory_xact_lock returns false (with a pg_backend_pid() sanity check)

Same commit also folded in a few small hardening points from a staff-engineer review pass — fail-closed lock helper (lock-SQL errors log WARNING and skip rather than being silently swallowed), cr.commit() reordering so nothing runs after the lock is released, and promoting the skip-path log from DEBUG to INFO so cold-start symptoms are diagnosable. PR body updated with the full breakdown.

Local: ./scripts/test_single_module.sh spp_api_v2 → 712/712 green. Codecov gate should be happier this time. Ready for your re-review whenever.

The oca-gen-addon-readme pre-commit hook regenerates README.rst and
static/description/index.html from spp_api_v2/readme/ fragments. The
prior commit added a 19.0.2.0.1 entry to HISTORY.md but didn't run the
generator, so CI's pre-commit job failed with "files were modified by
this hook." This commit lands the regenerated outputs.

Also drops an unused LOGGER_NAME constant from test_ir_http_patch.py
that was only needed by an integration test attempt I abandoned (the
routing_map lock-acquired-and-do-work success path is hard to exercise
under Odoo's test cursor / savepoint mechanics; happy to revisit in a
follow-up). The five remaining tests all pass.
…ed helper

Adds TestRoutingMapSyncBranches with two integration tests that drive
routing_map's lock-check call site through both branches:

- test_skip_branch_when_helper_returns_false: patches the helper to
  return False, asserts INFO 'sync skipped' log fires and routing map
  is still built.
- test_sync_branch_when_helper_returns_true: patches the helper to
  return True, asserts no skip log fires and the broad-except didn't
  swallow an error inside the sync body.

We patch the helper at the module level rather than coordinate a real
cross-backend advisory lock because Odoo's test mode wraps registry
.cursor() calls in a TestCursor that shares the test backend, so
pg_advisory_xact_lock is naturally re-entrant within the test thread
and the 'lock held elsewhere' branch is impossible to reproduce from a
same-thread test. Patching exercises both branches deterministically.

Pushes patch coverage on the routing_map sync block to the previously
uncovered call site (line 128: `if not _try_acquire_fastapi_sync_lock(cr):`)
and the else branch entry (env=, search=). Targets the codecov gate.
Split the long f-string assertion message in test_skip_branch... so
the line stays under ruff's 120-char limit. Also picks up ruff-format's
single-line-with-statement collapse on the patch.object() block.
Adds two more tests to TestRoutingMapSyncBranches that exercise the
previously-uncovered success-path body in routing_map's sync block:

- test_sync_branch_invokes_action_sync_when_unsynced_present: verifies
  that when unsynced endpoints exist, action_sync_registry is called and
  the 'Synced N FastAPI endpoints' INFO log fires (covers lines 169-171
  and the cr.commit() at 181).
- test_sync_branch_resyncs_orphan_endpoint_with_no_route: verifies the
  orphan-route detection loop — a synced endpoint whose endpoint.route
  was deleted gets re-flipped to registry_sync=False and rolled into
  the unsynced set for resync (covers lines 155-167).

Mocks odoo.api.Environment to return a controlled fake env with a
predictable search() to avoid Odoo's TestCursor / fresh-Environment
isolation that prevented the earlier integration approach from seeing
the test transaction's writes.

This should bring patch coverage past the 70% codecov gate.
Copy link
Copy Markdown
Contributor

@emjay0921 emjay0921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All checks green. The advisory-lock approach is the right shape for this race — deterministic 64-bit key derived from a stable hash so workers across processes converge on the same lock, transaction-scoped so it's released even if the worker dies mid-sync. The 334-line test file covers the contention, lock-acquisition timing, and the noise-suppression behaviour. Ready to merge.

@gonzalesedwin1123 gonzalesedwin1123 merged commit 7e8c009 into 19.0 May 4, 2026
35 checks passed
@gonzalesedwin1123 gonzalesedwin1123 deleted the fix/spp-api-v2-fastapi-endpoint-sync-race branch May 4, 2026 08:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants