From 8acce78d3713db9d3b240a4139fb084d4fc02367 Mon Sep 17 00:00:00 2001 From: Didar Shayarov Date: Wed, 13 May 2026 23:17:48 +0300 Subject: [PATCH 1/3] IGNITE-28671 Describe healthy cluster behavior in general tips guide --- .../general-perf-tips.adoc | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/docs/_docs/perf-and-troubleshooting/general-perf-tips.adoc b/docs/_docs/perf-and-troubleshooting/general-perf-tips.adoc index 99ec7de8f4d80..800c0dd67993e 100644 --- a/docs/_docs/perf-and-troubleshooting/general-perf-tips.adoc +++ b/docs/_docs/perf-and-troubleshooting/general-perf-tips.adoc @@ -12,7 +12,7 @@ // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. // See the License for the specific language governing permissions and // limitations under the License. -= Generic Performance Tips += General Performance Tips Ignite as distributed storages and platforms require certain optimization techniques. Before you dive into the more advanced techniques described in this and other articles, consider the following basic checklist: @@ -47,3 +47,20 @@ queries with JOINs at massive scale and expect significant performance benefits. * Adjust link:data-rebalancing[data rebalancing settings] to ensure that rebalancing completes faster when your cluster topology changes. +== What healthy cluster behavior looks like + +A healthy Ignite cluster is not defined by a single latency, CPU, or memory number. In practice, it is a cluster whose topology is stable, whose cluster state and baseline match the intended deployment, whose partitions are not lost or divergent, whose rebalancing and checkpointing complete in bounded time, and whose execution queues and memory pools return to a steady level after short-lived spikes. Ignite exposes these signals through built-in metrics, system views, and the control script rather than through a single aggregate health score. + +When checking whether a cluster is healthy, start with topology and cluster state. The cluster should be in the expected state, usually ACTIVE, and the number of server and client nodes should be stable. If native persistence is enabled, the baseline should also be in the expected shape: for a stable deployment, the nodes that are expected to be online should appear online both in baseline-related metrics and in the SYS.BASELINE_NODES system view. Frequent unexpected topology changes are not normal and should be treated as a sign of node instability or network problems. + +Then check data safety and convergence. A healthy cluster does not have lost partitions, and consistency checks such as control.sh --cache idle_verify should not report conflict partitions when the cluster is idle. After a topology event, transient rebalancing is expected, but it should converge: KeysToRebalanceLeft should trend to zero, and partition states should settle back to OWNING rather than remain in MOVING, RENTING, or LOST. + +Next, check execution pressure. Communication, discovery, and thread-pool queues may spike under load, but they should not grow continuously. In Ignite, sustained growth of OutboundMessagesQueueSize, MessageWorkerQueueSize, or thread pool queue sizes means that the node is not keeping up with the workload or that message processing is impaired. The same logic applies to the striped executor: temporary backlog can happen, but a persistent backlog or repeating starvation warnings are signs of contention, hot partitions, or blocked internal processing. Use SYS.STRIPED_THREADPOOL_QUEUE, SYS.TRANSACTIONS, and SYS.SQL_QUERIES for a live view of the work that is not draining. + +Checkpointing and transactions should also remain bounded. Checkpoint activity can slow the cluster down, so LastCheckpointDuration should be monitored together with dirty pages and disk behavior. Transactions and queries can legitimately take longer during bursts, but healthy steady-state behavior means that lock-holding transactions, long-running transactions, and long-running SQL queries do not accumulate over time. If long transactions repeatedly block partition map exchange, use transaction timeout settings such as TxTimeoutOnPartitionMapExchange and investigate the application path that keeps transactions open. + +Finally, check the underlying JVM and critical workers. Ignite treats IgniteOutOfMemoryException, OutOfMemoryError, system worker termination, system worker hangs, and cluster node segmentation as critical failures. A healthy cluster should not emit blocked system-critical worker messages, and JVM resource pools should stay comfortably below exhaustion. In practice, monitor heap usage, direct buffer usage, and open file descriptors continuously, because all three are finite pools and approaching their limits usually means the node is already close to a failure condition rather than merely under benign load. + +If CDC is enabled, include it in the health model explicitly. A healthy CDC pipeline is not just a running process: it is a process whose WAL segment consumption advances, whose ignite-cdc.log does not contain repeated warnings or errors, and whose segment sequence has no gaps. Missed segments indicate lost changes and should be treated as a data-delivery problem, not as a minor logging issue. + +As a practical rule, if topology is stable, baseline is correct, partitions are consistent, queues drain, checkpointing and rebalancing converge, long-lived transactions and queries do not accumulate, and JVM and OS resource pools remain below saturation, the cluster is usually in a normal operational state. Temporary spikes are acceptable during rebalancing, index rebuild, snapshots, bulk loading, or analytical queries, but healthy systems converge back to their baseline after the event ends. From ed59869e344c99d5c9c3fbce00f0ff608d74a05b Mon Sep 17 00:00:00 2001 From: Didar Shayarov Date: Thu, 14 May 2026 22:53:27 +0300 Subject: [PATCH 2/3] fix review, add cross links, reduce cdc section, add minors --- .../general-perf-tips.adoc | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/docs/_docs/perf-and-troubleshooting/general-perf-tips.adoc b/docs/_docs/perf-and-troubleshooting/general-perf-tips.adoc index 800c0dd67993e..38c2979dfad04 100644 --- a/docs/_docs/perf-and-troubleshooting/general-perf-tips.adoc +++ b/docs/_docs/perf-and-troubleshooting/general-perf-tips.adoc @@ -49,18 +49,18 @@ queries with JOINs at massive scale and expect significant performance benefits. == What healthy cluster behavior looks like -A healthy Ignite cluster is not defined by a single latency, CPU, or memory number. In practice, it is a cluster whose topology is stable, whose cluster state and baseline match the intended deployment, whose partitions are not lost or divergent, whose rebalancing and checkpointing complete in bounded time, and whose execution queues and memory pools return to a steady level after short-lived spikes. Ignite exposes these signals through built-in metrics, system views, and the control script rather than through a single aggregate health score. +A healthy Ignite cluster has a stable topology, the expected cluster state and baseline, a consistent partition state, rebalancing and checkpointing activity that converges, and execution queues that return to normal levels after temporary load spikes. Ignite exposes these signals through metrics, system views, logs, and the control script; there is no single aggregate health indicator that can describe all deployments and workloads. -When checking whether a cluster is healthy, start with topology and cluster state. The cluster should be in the expected state, usually ACTIVE, and the number of server and client nodes should be stable. If native persistence is enabled, the baseline should also be in the expected shape: for a stable deployment, the nodes that are expected to be online should appear online both in baseline-related metrics and in the SYS.BASELINE_NODES system view. Frequent unexpected topology changes are not normal and should be treated as a sign of node instability or network problems. +Start with topology and cluster state. The cluster should be in the expected state, usually link:../monitoring-metrics/cluster-states[ACTIVE] for read-write workloads, and the number of server and client nodes should match the intended deployment. If native persistence is enabled, check the link:../clustering/baseline-topology[baseline topology] as well: nodes that are expected to store data should be present and online according to `control.sh --baseline` and `SYS.BASELINE_NODES`. Frequent unexpected topology changes are not normal and usually indicate node instability, network issues, or misconfigured discovery. -Then check data safety and convergence. A healthy cluster does not have lost partitions, and consistency checks such as control.sh --cache idle_verify should not report conflict partitions when the cluster is idle. After a topology event, transient rebalancing is expected, but it should converge: KeysToRebalanceLeft should trend to zero, and partition states should settle back to OWNING rather than remain in MOVING, RENTING, or LOST. +Then check data safety and convergence. A healthy cluster does not have lost partitions, and consistency checks such as `control.sh --cache idle_verify` should not report conflict partitions when the cluster is idle. After a topology event, transient rebalancing is expected, but it should converge: rebalance progress should move toward completion, and partition states should settle back to `OWNING` rather than remain in `MOVING`, `RENTING`, or `LOST`. Use `SYS.PARTITION_STATES` to inspect the current partition state. -Next, check execution pressure. Communication, discovery, and thread-pool queues may spike under load, but they should not grow continuously. In Ignite, sustained growth of OutboundMessagesQueueSize, MessageWorkerQueueSize, or thread pool queue sizes means that the node is not keeping up with the workload or that message processing is impaired. The same logic applies to the striped executor: temporary backlog can happen, but a persistent backlog or repeating starvation warnings are signs of contention, hot partitions, or blocked internal processing. Use SYS.STRIPED_THREADPOOL_QUEUE, SYS.TRANSACTIONS, and SYS.SQL_QUERIES for a live view of the work that is not draining. +Next, check execution pressure. Communication, discovery, and thread-pool queues may grow for a short time under load, but they should not grow continuously. Sustained queue growth means that a node is not keeping up with the workload or that message processing is impaired. The same logic applies to the striped executor: temporary backlog can happen, but persistent backlog or repeating starvation warnings are signs of contention, hot partitions, or blocked internal processing. Use `SYS.STRIPED_THREADPOOL_QUEUE`, `SYS.TRANSACTIONS`, and `SYS.SQL_QUERIES` to inspect work that is not draining. -Checkpointing and transactions should also remain bounded. Checkpoint activity can slow the cluster down, so LastCheckpointDuration should be monitored together with dirty pages and disk behavior. Transactions and queries can legitimately take longer during bursts, but healthy steady-state behavior means that lock-holding transactions, long-running transactions, and long-running SQL queries do not accumulate over time. If long transactions repeatedly block partition map exchange, use transaction timeout settings such as TxTimeoutOnPartitionMapExchange and investigate the application path that keeps transactions open. +Checkpointing, transactions, and queries should also remain bounded. Checkpoint activity can slow the cluster down, so monitor checkpoint duration together with dirty pages and storage I/O. Transactions and queries can legitimately take longer during short bursts, but in a healthy steady state lock-holding transactions, long-running transactions, and long-running SQL queries do not accumulate over time. If long transactions repeatedly block partition map exchange, consider configuring a partition-map-exchange transaction timeout, such as `TransactionConfiguration.setTxTimeoutOnPartitionMapExchange(...)`, and investigate the application path that keeps transactions open. -Finally, check the underlying JVM and critical workers. Ignite treats IgniteOutOfMemoryException, OutOfMemoryError, system worker termination, system worker hangs, and cluster node segmentation as critical failures. A healthy cluster should not emit blocked system-critical worker messages, and JVM resource pools should stay comfortably below exhaustion. In practice, monitor heap usage, direct buffer usage, and open file descriptors continuously, because all three are finite pools and approaching their limits usually means the node is already close to a failure condition rather than merely under benign load. +Finally, check JVM resources and critical workers. Ignite reports conditions such as `IgniteOutOfMemoryException`, `OutOfMemoryError`, system worker termination, system worker hangs, and cluster node segmentation as critical failures and passes them to the configured failure handler. Depending on the failure handler, this may result in node invalidation, failover handling, node stop, or JVM termination. A healthy cluster should not repeatedly emit blocked system-critical worker messages, and JVM/OS resource pools should remain below configured or environment-specific limits. Monitor heap usage, direct buffer usage, and open file descriptors continuously because all three are finite resources. -If CDC is enabled, include it in the health model explicitly. A healthy CDC pipeline is not just a running process: it is a process whose WAL segment consumption advances, whose ignite-cdc.log does not contain repeated warnings or errors, and whose segment sequence has no gaps. Missed segments indicate lost changes and should be treated as a data-delivery problem, not as a minor logging issue. +If CDC is enabled, include it in the same operational check. A healthy CDC process consumes WAL segments without gaps; repeated warnings or errors in `ignite-cdc.log`, or missed WAL segments, indicate a data-delivery risk and should not be treated as a normal performance spike. -As a practical rule, if topology is stable, baseline is correct, partitions are consistent, queues drain, checkpointing and rebalancing converge, long-lived transactions and queries do not accumulate, and JVM and OS resource pools remain below saturation, the cluster is usually in a normal operational state. Temporary spikes are acceptable during rebalancing, index rebuild, snapshots, bulk loading, or analytical queries, but healthy systems converge back to their baseline after the event ends. +As a practical rule, if topology is stable, baseline is correct, partitions are consistent, queues drain, checkpointing and rebalancing converge, long-lived transactions and queries do not accumulate, and JVM and OS resource pools remain below their limits, the cluster is usually in a normal operational state. Temporary spikes are acceptable during rebalancing, index rebuild, snapshots, bulk loading, or analytical queries, but healthy systems converge back to their baseline after the event ends. From f3e3216e0240dfc407633b38af94af23d8edf0bf Mon Sep 17 00:00:00 2001 From: Didar Shayarov Date: Tue, 19 May 2026 00:15:03 +0300 Subject: [PATCH 3/3] IGNITE-28671 Simplify adoc links --- docs/_docs/perf-and-troubleshooting/general-perf-tips.adoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/_docs/perf-and-troubleshooting/general-perf-tips.adoc b/docs/_docs/perf-and-troubleshooting/general-perf-tips.adoc index 38c2979dfad04..d11a544b2b638 100644 --- a/docs/_docs/perf-and-troubleshooting/general-perf-tips.adoc +++ b/docs/_docs/perf-and-troubleshooting/general-perf-tips.adoc @@ -51,7 +51,7 @@ queries with JOINs at massive scale and expect significant performance benefits. A healthy Ignite cluster has a stable topology, the expected cluster state and baseline, a consistent partition state, rebalancing and checkpointing activity that converges, and execution queues that return to normal levels after temporary load spikes. Ignite exposes these signals through metrics, system views, logs, and the control script; there is no single aggregate health indicator that can describe all deployments and workloads. -Start with topology and cluster state. The cluster should be in the expected state, usually link:../monitoring-metrics/cluster-states[ACTIVE] for read-write workloads, and the number of server and client nodes should match the intended deployment. If native persistence is enabled, check the link:../clustering/baseline-topology[baseline topology] as well: nodes that are expected to store data should be present and online according to `control.sh --baseline` and `SYS.BASELINE_NODES`. Frequent unexpected topology changes are not normal and usually indicate node instability, network issues, or misconfigured discovery. +Start with topology and cluster state. The cluster should be in the expected state, usually link:monitoring-metrics/cluster-states[ACTIVE] for read-write workloads, and the number of server and client nodes should match the intended deployment. If native persistence is enabled, check the link:clustering/baseline-topology[baseline topology] as well: nodes that are expected to store data should be present and online according to `control.sh --baseline` and `SYS.BASELINE_NODES`. Frequent unexpected topology changes are not normal and usually indicate node instability, network issues, or misconfigured discovery. Then check data safety and convergence. A healthy cluster does not have lost partitions, and consistency checks such as `control.sh --cache idle_verify` should not report conflict partitions when the cluster is idle. After a topology event, transient rebalancing is expected, but it should converge: rebalance progress should move toward completion, and partition states should settle back to `OWNING` rather than remain in `MOVING`, `RENTING`, or `LOST`. Use `SYS.PARTITION_STATES` to inspect the current partition state.