Skip to content

feat(tidb): add TiFlash replication lag, PD metrics, and dashboard-aligned metrics#2982

Open
premal wants to merge 3 commits into
DataDog:masterfrom
6si:tiflash-replication-lag
Open

feat(tidb): add TiFlash replication lag, PD metrics, and dashboard-aligned metrics#2982
premal wants to merge 3 commits into
DataDog:masterfrom
6si:tiflash-replication-lag

Conversation

@premal
Copy link
Copy Markdown

@premal premal commented Apr 24, 2026

Summary

  • Add tiflash_syncing_data_freshness histogram to track TiFlash replication lag from TiKV (avg/p50/p95/p99)
  • Add PD_METRICS list: pd_client_cmd_handle_cmds_duration_seconds, pd_client_request_handle_requests_duration_seconds
  • Add TiDB session phase duration metrics: parse, compile, execute, transaction
  • Add TiDB connection/server metrics: get_token, conn_idle, query_total, disconnection_total, plan_cache_total, plan_cache_miss_total
  • Add tidb_tikvclient_request_seconds (TiKV client latency from TiDB's perspective)
  • Add TiKV raftstore metrics: append/apply/commit log duration, store/apply duration, async storage request duration
  • Add TiKV gRPC, engine flow, and thread CPU metrics
  • Update check.py to include PD_METRICS in default metric list
  • Add fixture data and unit tests for all new metrics (EXPECTED_PD, extended EXPECTED_TIDB/EXPECTED_TIKV)
  • Update metadata.csv with all new metric definitions
  • Add 4 new dashboard widget groups in overview.json: TiDB query internals, TiFlash replication, TiKV raftstore & gRPC, PD client

Motivation

The existing integration collected only a small subset of the metrics visible in TiDB Dashboard's Monitoring page. This PR aligns the Datadog integration with the full set of metrics that TiDB operators actually use for day-to-day monitoring, and adds TiFlash replication lag which was previously missing entirely.

Test plan

  • All new metrics have fixture data in tests/fixtures/
  • New test_pd_mock_metrics unit test added
  • EXPECTED_TIDB and EXPECTED_TIKV extended with representative tags for each new metric
  • metadata.csv updated with type, unit, and description for all new entries
  • Dashboard JSON validated (valid JSON, all metric names match metadata.csv)

🤖 Generated with Claude Code

premal and others added 2 commits April 23, 2026 11:37
…trics

- Add tiflash_syncing_data_freshness histogram (TiFlash replication lag from TiKV)
- Add PD_METRICS list: pd_client_cmd_handle_cmds_duration_seconds, pd_client_request_handle_requests_duration_seconds
- Add TiDB session phase duration metrics (parse/compile/execute/transaction)
- Add TiDB connection metrics (get_token, conn_idle) and server metrics (query_total, disconnection, plan_cache)
- Add tidb_tikvclient_request_seconds (TiKV client latency seen from TiDB)
- Add TiKV raftstore metrics (append/apply/commit log, store/apply duration)
- Add TiKV gRPC, engine flow, and async storage request metrics
- Update check.py to include PD_METRICS in default metric list
- Add fixture data and unit tests for all new metrics (EXPECTED_PD, extended EXPECTED_TIDB/TIKV)
- Update metadata.csv with all new metric definitions

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add 4 new dashboard groups covering metrics not previously visualized:
- TiDB query internals: total QPS, plan cache, parse/compile/execute duration,
  TiKV client request latency, connection idle duration
- TiFlash replication: replication lag histogram (avg/p50/p95/p99)
- TiKV raftstore & gRPC: raftstore log/store/apply duration, gRPC message
  duration, engine flow bytes, async storage request duration
- PD client: PD command and request handling duration (avg/p99)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@premal premal requested review from a team as code owners April 24, 2026 02:15
@premal premal requested a review from JoshPatel13 April 24, 2026 02:15
….count expectation

OpenMetrics base check emits histogram count/sum rows with upper_bound:none tag.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants