Current behavior
Locket's gRPC handler methods (Lock, Release, Fetch, FetchAll) pass the inbound gRPC request context directly to db.BeginTx(). When a client's gRPC deadline fires (default 5s for Rep's Lock RPC), the context cancellation propagates into the in-flight database transaction, killing it with context canceled in failed-starting-transaction.
This causes the following cascade under network degradation:
- Network delay pushes Lock() RPC past the 5s gRPC deadline
- gRPC context cancellation fires
db.BeginTx() receives the cancelled context → failed-starting-transaction: context canceled
- Lock renewal fails → TTL expires →
lock-expired
- Cell re-acquires on next successful retry, but the cycle repeats every ~60-90s per cell
- At production scale (~100 cells/AZ), the cancelled transactions poison the DB connection pool, causing cross-zone contamination
Observed in production: ~50% cell lock drop across both AZs during a single-AZ network degradation event (2026-04-21, eu01-canary landscape).
Reproduced in test: lod-aws-0421 landscape with 2500ms delay + 10% packet loss on z1 cells:
- Baseline (no fix): 14
lock-expired events, all 8 z1 cells cycling
- With context detach patch: 0
lock-expired events under identical conditions
Root cause
In handlers/handler.go, each private method passes the gRPC ctx to the DB layer:
// handler.go line 177 (current main branch)
lock, err := h.db.Lock(ctx, logger, req.Resource, req.TtlInSeconds)
The DB layer uses this context for sql.BeginTx(). When the gRPC client deadline fires, Go's context package cancels the DB transaction — even if the transaction would have completed in <100ms.
The same pattern exists at:
h.db.Release(ctx, ...)
h.db.Fetch(ctx, ...)
h.db.FetchAll(ctx, ...)
Desired behavior
DB operations should use an independent context (context.Background() with a server-side timeout) so that client-side gRPC deadline expiry does not cancel in-flight database transactions. The gRPC context should still be used for cancellation/deadline metrics in monitorRequest.
Affected Version
Locket as shipped in diego-release v2.133.0 (and all prior versions — this pattern has existed since the handler was written).
Evidence
Locket stderr during chaos test (baseline):
{"timestamp":"...","source":"locket","message":"locket.lock.failed-locking-lock","data":{"error":"failed-starting-transaction: context canceled","key":"<cell-uuid>","owner":"<cell-rep>"}}
Followed immediately by:
{"timestamp":"...","source":"locket","message":"locket.lock-expired","data":{"key":"<cell-uuid>"}}
Fix
PR: (will link below)
Current behavior
Locket's gRPC handler methods (
Lock,Release,Fetch,FetchAll) pass the inbound gRPC request context directly todb.BeginTx(). When a client's gRPC deadline fires (default 5s for Rep's Lock RPC), the context cancellation propagates into the in-flight database transaction, killing it withcontext canceledinfailed-starting-transaction.This causes the following cascade under network degradation:
db.BeginTx()receives the cancelled context →failed-starting-transaction: context canceledlock-expiredObserved in production: ~50% cell lock drop across both AZs during a single-AZ network degradation event (2026-04-21, eu01-canary landscape).
Reproduced in test: lod-aws-0421 landscape with 2500ms delay + 10% packet loss on z1 cells:
lock-expiredevents, all 8 z1 cells cyclinglock-expiredevents under identical conditionsRoot cause
In
handlers/handler.go, each private method passes the gRPCctxto the DB layer:The DB layer uses this context for
sql.BeginTx(). When the gRPC client deadline fires, Go'scontextpackage cancels the DB transaction — even if the transaction would have completed in <100ms.The same pattern exists at:
h.db.Release(ctx, ...)h.db.Fetch(ctx, ...)h.db.FetchAll(ctx, ...)Desired behavior
DB operations should use an independent context (
context.Background()with a server-side timeout) so that client-side gRPC deadline expiry does not cancel in-flight database transactions. The gRPC context should still be used for cancellation/deadline metrics inmonitorRequest.Affected Version
Locket as shipped in diego-release v2.133.0 (and all prior versions — this pattern has existed since the handler was written).
Evidence
Locket stderr during chaos test (baseline):
Followed immediately by:
Fix
PR: (will link below)