Currently, the EventProducer uses the incoming triggerID (from the Pub/Sub message ID or Cloud Event ID) as the workerID when acquiring the SavedSearchState lock in Spanner.
The Problem:
This approach is unsafe for distributed locking. If a worker process stalls (e.g., GC pause) and the lock expires, a second worker might pick up the retry of the same message. Since the triggerID is identical, the second worker acquires the lock with the same ID. If the first "zombie" worker wakes up, the database cannot distinguish between them because they share the same ID. This defeats the fencing token protection and could lead to data corruption (Split Brain) if the zombie worker overwrites the state.
The Solution:
The ProcessSearch method in the EventProducer must be updated to generate a new, random UUID (v4) at the start of every execution. This UUID should be used exclusively as the workerID for locking operations:
TryAcquireSavedSearchStateWorkerLock
PublishSavedSearchNotificationEvent (for the fencing check)
ReleaseSavedSearchStateWorkerLock
Acceptance Criteria:
ProcessSearch generates a unique workerUUID.
- The
workerUUID is passed to the lock acquisition method instead of triggerID.
- The
workerUUID is passed to the publish/release methods to verify ownership.
- The
triggerID is maintained for tracing and as the EventID for the resulting notification, but it is not used for the lock identity.
Currently, the
EventProduceruses the incomingtriggerID(from the Pub/Sub message ID or Cloud Event ID) as theworkerIDwhen acquiring theSavedSearchStatelock in Spanner.The Problem:
This approach is unsafe for distributed locking. If a worker process stalls (e.g., GC pause) and the lock expires, a second worker might pick up the retry of the same message. Since the
triggerIDis identical, the second worker acquires the lock with the same ID. If the first "zombie" worker wakes up, the database cannot distinguish between them because they share the same ID. This defeats the fencing token protection and could lead to data corruption (Split Brain) if the zombie worker overwrites the state.The Solution:
The
ProcessSearchmethod in theEventProducermust be updated to generate a new, random UUID (v4) at the start of every execution. This UUID should be used exclusively as theworkerIDfor locking operations:TryAcquireSavedSearchStateWorkerLockPublishSavedSearchNotificationEvent(for the fencing check)ReleaseSavedSearchStateWorkerLockAcceptance Criteria:
ProcessSearchgenerates a uniqueworkerUUID.workerUUIDis passed to the lock acquisition method instead oftriggerID.workerUUIDis passed to the publish/release methods to verify ownership.triggerIDis maintained for tracing and as theEventIDfor the resulting notification, but it is not used for the lock identity.