Skip to content

backendcluster: add cluster manager and cluster-scoped topology runtime#1104

Open
YangKeao wants to merge 2 commits intopingcap:mainfrom
YangKeao:pr/03-multi-cluster-runtime
Open

backendcluster: add cluster manager and cluster-scoped topology runtime#1104
YangKeao wants to merge 2 commits intopingcap:mainfrom
YangKeao:pr/03-multi-cluster-runtime

Conversation

@YangKeao
Copy link
Member

@YangKeao YangKeao commented Mar 19, 2026

What problem does this PR solve?

Issue Number: close #1098

What is changed and how it works:

Introduce a backend-cluster manager that owns cluster-scoped runtime instances.

This PR adds:

  • a manager for configured backend clusters
  • one runtime per backend cluster
  • cluster-scoped etcd / infosync / shared clients
  • topology aggregation across clusters
  • dynamic add / update / remove handling when backend-cluster config changes

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Notable changes

  • Has configuration change
  • Has HTTP API interfaces change
  • Has tiproxyctl change
  • Other user behavior changes

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

@ti-chi-bot
Copy link

ti-chi-bot bot commented Mar 19, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot
Copy link

ti-chi-bot bot commented Mar 19, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign xhebox for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the size/XXL label Mar 19, 2026
@YangKeao YangKeao force-pushed the pr/03-multi-cluster-runtime branch from 98ea284 to 3993ee3 Compare March 19, 2026 17:33
@YangKeao YangKeao marked this pull request as ready for review March 19, 2026 17:37
@ti-chi-bot ti-chi-bot bot requested a review from djshow832 March 19, 2026 17:37
@codecov-commenter
Copy link

codecov-commenter commented Mar 19, 2026

Codecov Report

❌ Patch coverage is 65.88235% with 116 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@4d841da). Learn more about missing BASE report.

Files with missing lines Patch % Lines
pkg/manager/backendcluster/manager.go 60.37% 66 Missing and 18 partials ⚠️
pkg/server/server.go 58.82% 6 Missing and 1 partial ⚠️
pkg/balance/observer/health_check.go 44.44% 4 Missing and 1 partial ⚠️
pkg/proxy/backend/backend_conn_mgr.go 54.54% 4 Missing and 1 partial ⚠️
pkg/balance/router/router.go 55.55% 4 Missing ⚠️
pkg/balance/router/router_static.go 0.00% 4 Missing ⚠️
pkg/manager/namespace/manager.go 0.00% 4 Missing ⚠️
pkg/balance/router/group.go 81.25% 3 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1104   +/-   ##
=======================================
  Coverage        ?   67.13%           
=======================================
  Files           ?      143           
  Lines           ?    15067           
  Branches        ?        0           
=======================================
  Hits            ?    10115           
  Misses          ?     4258           
  Partials        ?      694           
Flag Coverage Δ
unit 67.13% <65.88%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@YangKeao
Copy link
Member Author

/test all

@YangKeao
Copy link
Member Author

TestLocalDirWatcher is not stable. Let me try to stablize it.

/retest

@YangKeao YangKeao force-pushed the pr/03-multi-cluster-runtime branch 2 times, most recently from 25a54ca to cd317e4 Compare March 20, 2026 06:20
@YangKeao
Copy link
Member Author

/test all

@YangKeao YangKeao force-pushed the pr/03-multi-cluster-runtime branch 6 times, most recently from 9eea95f to f97603c Compare March 24, 2026 13:13
@YangKeao YangKeao force-pushed the pr/03-multi-cluster-runtime branch from f97603c to 1add4f8 Compare March 24, 2026 13:52
@YangKeao YangKeao marked this pull request as ready for review March 24, 2026 13:57
Introduce the backend cluster manager, cluster-scoped InfoSync runtime, topology aggregation, and single-cluster compatibility hooks.
@YangKeao YangKeao force-pushed the pr/03-multi-cluster-runtime branch from 1add4f8 to af488c9 Compare March 24, 2026 14:21
@djshow832
Copy link
Collaborator

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 82ae03a9dc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +115 to +118
if cluster := srv.clusterManager.PrimaryCluster(); cluster != nil {
srv.etcdCli = cluster.EtcdClient()
promFetcher = cluster
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Handle multi-cluster startup without nil ETCD client

When more than one backend cluster is configured, PrimaryCluster() returns nil and this block leaves both srv.etcdCli and promFetcher unset, but the metrics reader is still started afterward. In that configuration, Prometheus fetch is unavailable and backend metrics fallback eventually calls BackendReader.queryAllOwners/etcd.GetKVs with a nil etcd client, which panics and stops the metrics loop; this makes multi-cluster deployments lose metrics-driven balancing right after startup.

Useful? React with 👍 / 👎.

{
healthCheckCfg := config.NewDefaultHealthCheckConfig()
srv.metricsReader = metricsreader.NewDefaultMetricsReader(lg.Named("mr"), srv.infoSyncer, srv.infoSyncer, srv.httpCli, srv.etcdCli, healthCheckCfg, srv.configManager)
srv.metricsReader = metricsreader.NewDefaultMetricsReader(lg.Named("mr"), promFetcher, srv.clusterManager, srv.httpCli, srv.etcdCli, healthCheckCfg, srv.configManager)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid pinning readers to the initial primary cluster

This constructs metricsReader with the one-time promFetcher/srv.etcdCli selected at startup, but backend-cluster runtime is now hot-reloaded and old clusters are explicitly closed during syncClusters. After a backend-cluster PD update in a running node, the reader/VIP paths keep using stale handles from the old cluster instead of the newly active runtime, so topology/prom/election operations can fail permanently after config reload.

Useful? React with 👍 / 👎.

// Namespace always receives a topology fetcher from the cluster manager. PDFetcher preserves
// legacy static backend.instances compatibility by falling back internally before any backend
// cluster is configured.
fetcher := observer.NewPDFetcher(mgr.tpFetcher, cfg.Backend.Instances, logger.Named("be_fetcher"), healthCheckCfg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why you wrap the StaticFetcher in the PDFetcher. It makes PDFetcher even more complicated. StaticFetcher is only used for testing, especially mysql-connector-test.

}

func (dhc *DefaultHealthCheck) Check(ctx context.Context, addr string, info *BackendInfo, lastBh *BackendHealth) *BackendHealth {
func (dhc *DefaultHealthCheck) Check(ctx context.Context, _ string, info *BackendInfo, lastBh *BackendHealth) *BackendHealth {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do not need addr, remove it from the param list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add cluster manager to manager cluster-scoped topology.

3 participants