backendcluster: add cluster manager and cluster-scoped topology runtime by YangKeao · Pull Request #1104 · pingcap/tiproxy

YangKeao · 2026-03-19T16:45:32Z

What problem does this PR solve?

Issue Number: close #1098

What is changed and how it works:

Introduce a backend-cluster manager that owns cluster-scoped runtime instances.

This PR adds:

a manager for configured backend clusters
one runtime per backend cluster
cluster-scoped etcd / infosync / shared clients
topology aggregation across clusters
dynamic add / update / remove handling when backend-cluster config changes

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No code

Notable changes

Has configuration change
Has HTTP API interfaces change
Has tiproxyctl change
Other user behavior changes

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

ti-chi-bot · 2026-03-19T16:45:36Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

ti-chi-bot · 2026-03-19T16:45:37Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign xhebox for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

codecov-commenter · 2026-03-19T17:45:28Z

Codecov Report

❌ Patch coverage is 65.88235% with 116 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@4d841da). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
pkg/manager/backendcluster/manager.go	60.37%	66 Missing and 18 partials ⚠️
pkg/server/server.go	58.82%	6 Missing and 1 partial ⚠️
pkg/balance/observer/health_check.go	44.44%	4 Missing and 1 partial ⚠️
pkg/proxy/backend/backend_conn_mgr.go	54.54%	4 Missing and 1 partial ⚠️
pkg/balance/router/router.go	55.55%	4 Missing ⚠️
pkg/balance/router/router_static.go	0.00%	4 Missing ⚠️
pkg/manager/namespace/manager.go	0.00%	4 Missing ⚠️
pkg/balance/router/group.go	81.25%	3 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1104   +/-   ##
=======================================
  Coverage        ?   67.13%           
=======================================
  Files           ?      143           
  Lines           ?    15067           
  Branches        ?        0           
=======================================
  Hits            ?    10115           
  Misses          ?     4258           
  Partials        ?      694

Flag	Coverage Δ
unit	`67.13% <65.88%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

YangKeao · 2026-03-20T04:53:24Z

/test all

YangKeao · 2026-03-20T05:02:29Z

TestLocalDirWatcher is not stable. Let me try to stablize it.

/retest

YangKeao · 2026-03-20T06:22:20Z

/test all

Introduce the backend cluster manager, cluster-scoped InfoSync runtime, topology aggregation, and single-cluster compatibility hooks.

djshow832 · 2026-03-27T03:01:59Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 82ae03a9dc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-27T03:08:39Z

pkg/server/server.go

+	if cluster := srv.clusterManager.PrimaryCluster(); cluster != nil {
+		srv.etcdCli = cluster.EtcdClient()
+		promFetcher = cluster
+	}


Handle multi-cluster startup without nil ETCD client

When more than one backend cluster is configured, PrimaryCluster() returns nil and this block leaves both srv.etcdCli and promFetcher unset, but the metrics reader is still started afterward. In that configuration, Prometheus fetch is unavailable and backend metrics fallback eventually calls BackendReader.queryAllOwners/etcd.GetKVs with a nil etcd client, which panics and stops the metrics loop; this makes multi-cluster deployments lose metrics-driven balancing right after startup.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-27T03:08:39Z

pkg/server/server.go

 	{
 		healthCheckCfg := config.NewDefaultHealthCheckConfig()
-		srv.metricsReader = metricsreader.NewDefaultMetricsReader(lg.Named("mr"), srv.infoSyncer, srv.infoSyncer, srv.httpCli, srv.etcdCli, healthCheckCfg, srv.configManager)
+		srv.metricsReader = metricsreader.NewDefaultMetricsReader(lg.Named("mr"), promFetcher, srv.clusterManager, srv.httpCli, srv.etcdCli, healthCheckCfg, srv.configManager)


Avoid pinning readers to the initial primary cluster

This constructs metricsReader with the one-time promFetcher/srv.etcdCli selected at startup, but backend-cluster runtime is now hot-reloaded and old clusters are explicitly closed during syncClusters. After a backend-cluster PD update in a running node, the reader/VIP paths keep using stale handles from the old cluster instead of the newly active runtime, so topology/prom/election operations can fail permanently after config reload.

Useful? React with 👍 / 👎.

djshow832 · 2026-03-27T06:32:29Z

pkg/manager/namespace/manager.go

+	// Namespace always receives a topology fetcher from the cluster manager. PDFetcher preserves
+	// legacy static backend.instances compatibility by falling back internally before any backend
+	// cluster is configured.
+	fetcher := observer.NewPDFetcher(mgr.tpFetcher, cfg.Backend.Instances, logger.Named("be_fetcher"), healthCheckCfg)


I don't understand why you wrap the StaticFetcher in the PDFetcher. It makes PDFetcher even more complicated. StaticFetcher is only used for testing, especially mysql-connector-test.

djshow832 · 2026-03-27T06:40:46Z

pkg/balance/observer/health_check.go

 }

-func (dhc *DefaultHealthCheck) Check(ctx context.Context, addr string, info *BackendInfo, lastBh *BackendHealth) *BackendHealth {
+func (dhc *DefaultHealthCheck) Check(ctx context.Context, _ string, info *BackendInfo, lastBh *BackendHealth) *BackendHealth {


If you do not need addr, remove it from the param list.

ti-chi-bot bot added the do-not-merge/work-in-progress label Mar 19, 2026

ti-chi-bot bot requested review from bb7133 and xhebox March 19, 2026 16:45

ti-chi-bot bot added the size/XXL label Mar 19, 2026

YangKeao force-pushed the pr/03-multi-cluster-runtime branch from 98ea284 to 3993ee3 Compare March 19, 2026 17:33

YangKeao marked this pull request as ready for review March 19, 2026 17:37

ti-chi-bot bot removed the do-not-merge/work-in-progress label Mar 19, 2026

ti-chi-bot bot requested a review from djshow832 March 19, 2026 17:37

YangKeao marked this pull request as draft March 20, 2026 04:37

ti-chi-bot bot added the do-not-merge/work-in-progress label Mar 20, 2026

YangKeao force-pushed the pr/03-multi-cluster-runtime branch from 3993ee3 to a7c5384 Compare March 20, 2026 04:52

YangKeao force-pushed the pr/03-multi-cluster-runtime branch 2 times, most recently from 25a54ca to cd317e4 Compare March 20, 2026 06:20

YangKeao force-pushed the pr/03-multi-cluster-runtime branch 6 times, most recently from 9eea95f to f97603c Compare March 24, 2026 13:13

YangKeao force-pushed the pr/03-multi-cluster-runtime branch from f97603c to 1add4f8 Compare March 24, 2026 13:52

YangKeao marked this pull request as ready for review March 24, 2026 13:57

ti-chi-bot bot removed the do-not-merge/work-in-progress label Mar 24, 2026

backendcluster: add cluster manager and cluster-scoped topology runtime

af488c9

Introduce the backend cluster manager, cluster-scoped InfoSync runtime, topology aggregation, and single-cluster compatibility hooks.

YangKeao force-pushed the pr/03-multi-cluster-runtime branch from 1add4f8 to af488c9 Compare March 24, 2026 14:21

Merge branch 'main' into pr/03-multi-cluster-runtime

82ae03a

chatgpt-codex-connector bot reviewed Mar 27, 2026

View reviewed changes

djshow832 reviewed Mar 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backendcluster: add cluster manager and cluster-scoped topology runtime#1104

backendcluster: add cluster manager and cluster-scoped topology runtime#1104
YangKeao wants to merge 2 commits intopingcap:mainfrom
YangKeao:pr/03-multi-cluster-runtime

YangKeao commented Mar 19, 2026 •

edited

Loading

Uh oh!

ti-chi-bot bot commented Mar 19, 2026

Uh oh!

ti-chi-bot bot commented Mar 19, 2026

Uh oh!

codecov-commenter commented Mar 19, 2026 •

edited

Loading

Uh oh!

YangKeao commented Mar 20, 2026

Uh oh!

YangKeao commented Mar 20, 2026

Uh oh!

YangKeao commented Mar 20, 2026

Uh oh!

djshow832 commented Mar 27, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 27, 2026

Uh oh!

chatgpt-codex-connector bot Mar 27, 2026

Uh oh!

djshow832 Mar 27, 2026

Uh oh!

djshow832 Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

YangKeao commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

Check List

Release note

Uh oh!

ti-chi-bot bot commented Mar 19, 2026

Uh oh!

ti-chi-bot bot commented Mar 19, 2026

Uh oh!

codecov-commenter commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

YangKeao commented Mar 20, 2026

Uh oh!

YangKeao commented Mar 20, 2026

Uh oh!

YangKeao commented Mar 20, 2026

Uh oh!

djshow832 commented Mar 27, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

djshow832 Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

djshow832 Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

YangKeao commented Mar 19, 2026 •

edited

Loading

codecov-commenter commented Mar 19, 2026 •

edited

Loading