Skip to content

tiproxy: add spec field gracefulShutdownDeleteDelaySeconds to gracefully mark unhealthy before deleting the pods (#6829)#6894

Open
ti-chi-bot wants to merge 7 commits into
pingcap:release-2.1from
ti-chi-bot:cherry-pick-6829-to-release-2.1
Open

tiproxy: add spec field gracefulShutdownDeleteDelaySeconds to gracefully mark unhealthy before deleting the pods (#6829)#6894
ti-chi-bot wants to merge 7 commits into
pingcap:release-2.1from
ti-chi-bot:cherry-pick-6829-to-release-2.1

Conversation

@ti-chi-bot
Copy link
Copy Markdown
Member

This is an automated cherry-pick of #6829

Background

I'm designing the graceful restarting of TiProxy in cloud environment.

The intended behavior is:

  1. Use a big maxSurge.
  2. Patch the existing TiProxyGroup to enable a long graceful shutdown delete delay.
  3. Restart.
  4. Old TiProxy instances are first marked unhealthy, then kept alive for a while before the old pods are actually deleted.

This is mainly for cloud load balancers: existing long-lived connections can continue to work on the old TiProxy (with big enough target_health_state.unhealthy.draining_interval_seconds for AWS and disable ConnectionDrainEnabled for aliyun), while new connections should be sent to the new TiProxy instances.

We cannot rely on changing terminationGracePeriodSeconds for existing pods, because that would itself require restarting them. So this PR adds a controller-side graceful delete flow for TiProxy.

Design

  1. Add a new spec field spec.template.spec.gracefulShutdownDeleteDelaySeconds to TiProxyGroup / TiProxy.
  2. This field is treated as reloadable, so patching it does not trigger a rolling restart by itself.
  3. When a TiProxy object is being deleted and this field is set to a positive value:
    1. operator first tries to call POST /api/debug/health/unhealthy
    2. if the API is not supported (404), operator falls back to sending SIGTERM to the TiProxy process by pods/exec
    3. only after TiProxy is confirmed unhealthy, operator writes core.pingcap.com/tiproxy-graceful-shutdown-begin-time on the pod and starts the delete-delay timer
    4. after the timer expires, operator deletes the pod
  4. If TiProxy cannot be marked unhealthy, operator will keep retrying and will not start the delete-delay timer.
  5. When the whole Cluster is deleting, this graceful delay is skipped and the pod is deleted directly.

This design keeps the user-facing control in spec, avoids changing terminationGracePeriodSeconds, and supports both new TiProxy versions (with unhealthy API) and older ones (with SIGTERM fallback).

Usage

Patch an existing TiProxyGroup, so old TiProxy pods will be kept for a while after they are marked unhealthy:

kubectl --context "$CONTEXT" -n "$NS" patch tiproxygroup pg --type merge -p '{
  "spec": {
    "template": {
      "spec": {
        "gracefulShutdownDeleteDelaySeconds": 20
      }
    }
  }
}'

Then trigger a rolling restart, for example by changing config or image. With a large maxSurge, new TiProxy pods can come up first, and old TiProxy pods will only be deleted after entering graceful shutdown and waiting for
the configured delay.

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 67.12329% with 48 lines in your changes missing coverage. Please review.
✅ Project coverage is 37.95%. Comparing base (e5ddd8b) to head (a4aef15).

Additional details and impacted files
@@               Coverage Diff               @@
##           release-2.1    #6894      +/-   ##
===============================================
+ Coverage        37.61%   37.95%   +0.33%     
===============================================
  Files              392      393       +1     
  Lines            22483    22603     +120     
===============================================
+ Hits              8458     8579     +121     
+ Misses           14025    14024       -1     
Flag Coverage Δ
unittest 37.95% <67.12%> (+0.33%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@liubog2008
Copy link
Copy Markdown
Member

/lgtm

@ti-chi-bot ti-chi-bot Bot added the lgtm label May 14, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented May 14, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liubog2008

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented May 14, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-05-14 14:56:56.390273398 +0000 UTC m=+363984.923052717: ☑️ agreed by liubog2008.

@ti-chi-bot ti-chi-bot Bot added the approved label May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants