From c128d554927dc9459e1967b75f2a037df796710a Mon Sep 17 00:00:00 2001 From: xixirangrang Date: Mon, 23 Mar 2026 16:15:45 +0800 Subject: [PATCH 1/6] Add temp.md --- temp.md | 1 + 1 file changed, 1 insertion(+) create mode 100644 temp.md diff --git a/temp.md b/temp.md new file mode 100644 index 0000000000000..af27ff4986a7b --- /dev/null +++ b/temp.md @@ -0,0 +1 @@ +This is a test file. \ No newline at end of file From d42f95f207d914691828854af682fe89aaa6bd94 Mon Sep 17 00:00:00 2001 From: xixirangrang Date: Mon, 23 Mar 2026 16:15:51 +0800 Subject: [PATCH 2/6] Delete temp.md --- temp.md | 1 - 1 file changed, 1 deletion(-) delete mode 100644 temp.md diff --git a/temp.md b/temp.md deleted file mode 100644 index af27ff4986a7b..0000000000000 --- a/temp.md +++ /dev/null @@ -1 +0,0 @@ -This is a test file. \ No newline at end of file From f5fda69699a24170f788c3fc34a6266c2a332a56 Mon Sep 17 00:00:00 2001 From: "github-actions[bot]" Date: Mon, 23 Mar 2026 08:16:58 +0000 Subject: [PATCH 3/6] Auto-sync: Update English docs from Chinese PR Synced from: https://github.com/pingcap/docs-cn/pull/20664 Target PR: https://github.com/pingcap/docs/pull/22612 AI Provider: gemini Co-authored-by: github-actions[bot] --- alert-rules.md | 41 +++++++++++++---------------------------- 1 file changed, 13 insertions(+), 28 deletions(-) diff --git a/alert-rules.md b/alert-rules.md index 359befc35dbb9..b50be8430cfdb 100644 --- a/alert-rules.md +++ b/alert-rules.md @@ -244,19 +244,19 @@ This section gives the alert rules for the PD component. * If you confirm that the TiKV/TiFlash instance cannot be recovered, you can make it offline. * If you confirm that the TiKV/TiFlash instance can be recovered, but not in the short term, you can consider increasing the value of `max-down-time`. It will prevent the TiKV/TiFlash instance from being considered as irrecoverable and the data from being removed from the TiKV/TiFlash. -#### `PD_cluster_unhealthy_tikv_nums` +#### `PD_cluster_unhealthy_store_nums` -* Alert rule: +* Alarm Rule: `(sum(pd_cluster_status{type="store_unhealth_count"}) by (instance) > 0) and (sum(etcd_server_is_leader) by (instance) > 0)` -* Description: +* Rule Description: - Indicates that there are unhealthy stores. If the situation persists for some time (configured by [`max-store-down-time`](/pd-configuration-file.md#max-store-down-time), defaults to `30m`), the store is likely to change to `Offline` state, which triggers the [`PD_cluster_down_store_nums`](#pd_cluster_down_store_nums) alert. + Indicates that there are stores in an abnormal state. If this state persists for a period of time (depending on the configured [`max-store-down-time`](/pd-configuration-file.md#max-store-down-time), which defaults to `30m`), the store may enter the `Offline` state and trigger the [`PD_cluster_down_store_nums`](#pd_cluster_down_store_nums) alarm. -* Solution: +* Handling Suggestions: - Check the state of the TiKV stores. + Check the status of TiKV/TiFlash. #### `PD_cluster_low_space` @@ -355,20 +355,20 @@ This section gives the alert rules for the PD component. * Check the network and system load status. * If the problematic PD instance cannot be recovered due to environmental factors, make it offline and replace it. -#### `TiKV_space_used_more_than_80%` +#### `PD_cluster_store_space_used_more_than_80%` -* Alert rule: +* Alarm Rule: `sum(pd_cluster_status{type="storage_size"}) / sum(pd_cluster_status{type="storage_capacity"}) * 100 > 80` -* Description: +* Rule Description: - Over 80% of the cluster space is occupied. + The cluster space utilization exceeds 80%. -* Solution: +* Handling Suggestions: - * Check whether it is needed to increase capacity. - * Check whether there is any file that occupies a large amount of disk space, such as the log, snapshot, and core dump. + * Confirm if capacity expansion is needed. + * Investigate if any files are occupying a large amount of disk space, such as log files, snapshots, or core dump files. #### `PD_system_time_slow` @@ -384,21 +384,6 @@ This section gives the alert rules for the PD component. Check whether the system time is configured correctly. -#### `PD_no_store_for_making_replica` - -* Alert rule: - - `increase(pd_checker_event_count{type="replica_checker", name="no_target_store"}[1m]) > 0` - -* Description: - - There is no appropriate store for additional replicas. - -* Solution: - - * Check whether there is enough space in the store. - * Check whether there is any store for additional replicas according to the label configuration if it is configured. - #### `PD_cluster_slow_tikv_nums` * Alert rule: From 746c264272bfea00d53e16e1c52d2c1fe322c52f Mon Sep 17 00:00:00 2001 From: xixirangrang Date: Mon, 23 Mar 2026 16:24:54 +0800 Subject: [PATCH 4/6] Apply suggestions from code review --- alert-rules.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/alert-rules.md b/alert-rules.md index b50be8430cfdb..593f93c6c877c 100644 --- a/alert-rules.md +++ b/alert-rules.md @@ -246,15 +246,15 @@ This section gives the alert rules for the PD component. #### `PD_cluster_unhealthy_store_nums` -* Alarm Rule: +* Alert Rule: `(sum(pd_cluster_status{type="store_unhealth_count"}) by (instance) > 0) and (sum(etcd_server_is_leader) by (instance) > 0)` -* Rule Description: +* Description: - Indicates that there are stores in an abnormal state. If this state persists for a period of time (depending on the configured [`max-store-down-time`](/pd-configuration-file.md#max-store-down-time), which defaults to `30m`), the store may enter the `Offline` state and trigger the [`PD_cluster_down_store_nums`](#pd_cluster_down_store_nums) alarm. + Indicates that there are stores in an abnormal state. If this state persists for a period of time (depending on the configured [`max-store-down-time`](/pd-configuration-file.md#max-store-down-time), which defaults to `30m`), the store might enter the `Offline` state and trigger the [`PD_cluster_down_store_nums`](#pd_cluster_down_store_nums) alarm. -* Handling Suggestions: +* Suggestion: Check the status of TiKV/TiFlash. @@ -357,18 +357,18 @@ This section gives the alert rules for the PD component. #### `PD_cluster_store_space_used_more_than_80%` -* Alarm Rule: +* Alert Rule: `sum(pd_cluster_status{type="storage_size"}) / sum(pd_cluster_status{type="storage_capacity"}) * 100 > 80` -* Rule Description: +* Description: The cluster space utilization exceeds 80%. -* Handling Suggestions: +* Suggestion: - * Confirm if capacity expansion is needed. - * Investigate if any files are occupying a large amount of disk space, such as log files, snapshots, or core dump files. + * Check whether it is needed to increase capacity. + * Check whether there is any file that occupies a large amount of disk space, such as the log, snapshot, and core dump. #### `PD_system_time_slow` From e1828c8ca400eb5b685b29f69ce65e335333ca9a Mon Sep 17 00:00:00 2001 From: xixirangrang Date: Mon, 23 Mar 2026 16:26:20 +0800 Subject: [PATCH 5/6] Apply suggestions from code review --- alert-rules.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/alert-rules.md b/alert-rules.md index 593f93c6c877c..4f6bb9cd215d3 100644 --- a/alert-rules.md +++ b/alert-rules.md @@ -246,15 +246,15 @@ This section gives the alert rules for the PD component. #### `PD_cluster_unhealthy_store_nums` -* Alert Rule: +* Alert rule: `(sum(pd_cluster_status{type="store_unhealth_count"}) by (instance) > 0) and (sum(etcd_server_is_leader) by (instance) > 0)` * Description: - Indicates that there are stores in an abnormal state. If this state persists for a period of time (depending on the configured [`max-store-down-time`](/pd-configuration-file.md#max-store-down-time), which defaults to `30m`), the store might enter the `Offline` state and trigger the [`PD_cluster_down_store_nums`](#pd_cluster_down_store_nums) alarm. + Indicates that there are unhealthy stores. If the situation persists for some time (configured by [`max-store-down-time`](/pd-configuration-file.md#max-store-down-time), defaults to `30m`), the store is likely to change to `Offline` state, which triggers the [`PD_cluster_down_store_nums`](#pd_cluster_down_store_nums) alert. -* Suggestion: +* Solution: Check the status of TiKV/TiFlash. @@ -357,7 +357,7 @@ This section gives the alert rules for the PD component. #### `PD_cluster_store_space_used_more_than_80%` -* Alert Rule: +* Alert rule: `sum(pd_cluster_status{type="storage_size"}) / sum(pd_cluster_status{type="storage_capacity"}) * 100 > 80` @@ -365,7 +365,7 @@ This section gives the alert rules for the PD component. The cluster space utilization exceeds 80%. -* Suggestion: +* Solution: * Check whether it is needed to increase capacity. * Check whether there is any file that occupies a large amount of disk space, such as the log, snapshot, and core dump. From 1af851e3ceb9848b8c32d8f95efe3b47ad045c1d Mon Sep 17 00:00:00 2001 From: xixirangrang Date: Mon, 23 Mar 2026 16:27:00 +0800 Subject: [PATCH 6/6] Update alert-rules.md --- alert-rules.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/alert-rules.md b/alert-rules.md index 4f6bb9cd215d3..0539328cd47e2 100644 --- a/alert-rules.md +++ b/alert-rules.md @@ -363,7 +363,7 @@ This section gives the alert rules for the PD component. * Description: - The cluster space utilization exceeds 80%. + Over 80% of the cluster space is occupied. * Solution: