Skip to content

[Locket] gRPC client context propagates to DB transactions causing cascading lock expiry #1129

@vlast3k

Description

@vlast3k

Current behavior

Locket's gRPC handler methods (Lock, Release, Fetch, FetchAll) pass the inbound gRPC request context directly to db.BeginTx(). When a client's gRPC deadline fires (default 5s for Rep's Lock RPC), the context cancellation propagates into the in-flight database transaction, killing it with context canceled in failed-starting-transaction.

This causes the following cascade under network degradation:

  1. Network delay pushes Lock() RPC past the 5s gRPC deadline
  2. gRPC context cancellation fires
  3. db.BeginTx() receives the cancelled context → failed-starting-transaction: context canceled
  4. Lock renewal fails → TTL expires → lock-expired
  5. Cell re-acquires on next successful retry, but the cycle repeats every ~60-90s per cell
  6. At production scale (~100 cells/AZ), the cancelled transactions poison the DB connection pool, causing cross-zone contamination

Observed in production: ~50% cell lock drop across both AZs during a single-AZ network degradation event (2026-04-21, eu01-canary landscape).

Reproduced in test: lod-aws-0421 landscape with 2500ms delay + 10% packet loss on z1 cells:

  • Baseline (no fix): 14 lock-expired events, all 8 z1 cells cycling
  • With context detach patch: 0 lock-expired events under identical conditions

Root cause

In handlers/handler.go, each private method passes the gRPC ctx to the DB layer:

// handler.go line 177 (current main branch)
lock, err := h.db.Lock(ctx, logger, req.Resource, req.TtlInSeconds)

The DB layer uses this context for sql.BeginTx(). When the gRPC client deadline fires, Go's context package cancels the DB transaction — even if the transaction would have completed in <100ms.

The same pattern exists at:

  • h.db.Release(ctx, ...)
  • h.db.Fetch(ctx, ...)
  • h.db.FetchAll(ctx, ...)

Desired behavior

DB operations should use an independent context (context.Background() with a server-side timeout) so that client-side gRPC deadline expiry does not cancel in-flight database transactions. The gRPC context should still be used for cancellation/deadline metrics in monitorRequest.

Affected Version

Locket as shipped in diego-release v2.133.0 (and all prior versions — this pattern has existed since the handler was written).

Evidence

Locket stderr during chaos test (baseline):

{"timestamp":"...","source":"locket","message":"locket.lock.failed-locking-lock","data":{"error":"failed-starting-transaction: context canceled","key":"<cell-uuid>","owner":"<cell-rep>"}}

Followed immediately by:

{"timestamp":"...","source":"locket","message":"locket.lock-expired","data":{"key":"<cell-uuid>"}}

Fix

PR: (will link below)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Inbox

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions