Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
---
title: Kubernetes cross-cluster UI (Preview)
tags:
- Integrations
- Kubernetes integration
- Understand and use data
metaDescription: Explore and triage your entire fleet of clusters by using K8s cross-cluster UI
redirects:
- /docs/integrations/kubernetes-integration/understand-use-data/kubernetes-cross-cluster
- /docs/integrations/kubernetes-integration/cluster-explorer/kubernetes-cross-cluster
- /docs/kubernetes-pixie/kubernetes-integration/understand-use-data
freshnessValidatedDate: never
---

<Callout title="preview">
We're still working on this feature, but we'd love for you to try it out!

This feature is currently provided as part of a preview program pursuant to our [pre-release policies](/docs/licenses/license-information/referenced-policies/new-relic-pre-release-policy).
</Callout>


# Kubernetes Cross-Cluster UI: Fleet-Wide Clarity for Complex Environments

Modern Kubernetes environments have evolved into complex, **multi-cluster fleets**, but traditional observability tools often provide only a fractured, siloed view of individual clusters. Our **Kubernetes cross-cluster UI** provides a unified command center that transforms this multi-cluster complexity into **fleet-wide clarity**.

This unified view helps Platform Engineers and SREs to:
* **Unify fleet-wide observability** on a single dashboard compatible with clusters monitored by NR Agents or OpenTelemetry.
* **Accelerate root cause analysis** with a guided triage workflow.
* **Reduce costs** by identifying wasted resources across the entire fleet.
* **Empower developers** with an application-centric view for self-service.

<img
title="K8s cross-cluster UI"
alt="K8s cross-cluster UI"
src="/images/k8s-cross-cluster.webp"
/>

## Accessing the New UI

You can access the Kubernetes Cross-Cluster UI from the main New Relic menu.

1. Navigate to the **Kubernetes** option in the main menu.
2. Click the blue **"Try it out"** button located in the top right corner of the page.


## Sections and Navigation

The Kubernetes cross-cluster UI is structured to provide both a high-level overview and deep-dive capabilities into your entire Kubernetes fleet.

* **Feedback Button**: Allows users to easily provide input on their experience with the UI.
* **Entity Filterbar**: Offers the ability to drill down and triage issues by applying filters based on various **tags/values** at the **cluster or node level**.
* **Cluster Filter**: Enables cluster filtering to isolate the UI to a single or a smaller set of Kubernetes clusters.
* **Node Filter**: Enables node filtering to isolate the UI to a single or a smaller set of worker nodes (hosts).
* **Colored Scorecards**: Display relevant high-level metrics to surface the most important issues that need attention across the entire fleet.
* Clicking a scorecard **orders the table** in the context of that metric, surfacing the clusters and nodes with the most critical issues at the top.
* It also enables a **line chart** to visualize the metric's evolution over the selected time range.
* **Line Chart**: Provides the **evolution of the metric** over the selected time frame for trend analysis.
* **Table**: Lists all entities in the fleet and provides **detailed metric data** for each cluster under the five main tabs.
* **Health Focus**: The table provides a filter to only list "unhealthy" entities to focus your attention where it is really needed.
* **Search**: A text search bar allows you to filter table rows based on **free text** entered.
* **Color-Coding**: Like the scorecards, table cells are color-coded (Green, Yellow, Red) to quickly draw attention to entities with issues that meet a specific **severity threshold**.

## Drill-Down Capabilities for Triage

The cross-cluster UI is specifically designed to accelerate root cause analysis with a **guided triage workflow**. The primary mechanism for this deep investigation is the ability to drill down from a fleet-wide metric to a single cluster's view:

* **From Scorecard to Cluster Ranking:** When you click a **Colored Scorecard** (e.g., Unhealthy Pods), the Clusters table immediately re-sorts. This places the clusters with the worst performance for that specific metric at the top, enabling you to identify where the issue is most critical across your entire fleet.
* **From Table Metric to Kubernetes Navigator:** By clicking a metric value within any cell of the table for a specific cluster or node, the system automatically launches the **Kubernetes Navigator**. The Navigator's view is automatically **predefined/filtered** for that specific cluster and metric of interest, allowing you to continue your investigation with a deep dive.

## Tab-Specific Metrics

The UI is organized into **five main tabs**: **Overview**, **Health**, **Performance**, **Resources**, and **Workloads**, each providing specific metrics both at the aggregate (Scorecard) and cluster/node (Metric) level, along with their associated color-coded thresholds for quick triage.

### 1. Overview Tab

The **Overview** tab provides a general summary of the **operational health, capacity, and risk** across your entire Kubernetes fleet.

<details>
<summary>Metrics</summary>


| Metric Name | Scorecard Explanation | Metric Explanation (Cluster/Node) | Importance | Thresholds |
| :--- | :--- | :--- | :--- | :--- |
| **Total Clusters** | [cite_start]The total count of Kubernetes clusters currently reporting telemetry[cite: 327]. | N/A | [cite_start]Verifies the connectivity and scope of your entire fleet monitoring estate[cite: 327]. | N/A |
| **Unhealthy Nodes** | [cite_start]Percentage of nodes in the fleet reporting **NotReady** or **Unknown**[cite: 327]. | [cite_start]Percentage of nodes in the specific cluster reporting **NotReady**[cite: 327]. | [cite_start]Signals widespread instability and potential zone failures across the infrastructure[cite: 327]. | **Fleet:** Green: 0%; Yellow: 1% to 5%; Red: 5%. **Cluster:** Yellow: 1% to 5%; [cite_start]Red: 5%[cite: 327]. |
| **Unhealthy Pods** | [cite_start]Percentage of pods in the fleet in **Pending, Failed, or Unknown** states[cite: 327]. | [cite_start]Percentage of pods in the specific cluster in **Pending, Failed, or Unknown** states[cite: 327]. | [cite_start]Measures widespread application instability across all clusters[cite: 327]. | **Fleet:** Green: < 1%; Yellow: 1% to 5%; Red: 5%. **Cluster:** Yellow: 1% to 5%; [cite_start]Red: 5%[cite: 327]. |
| **Unhealthy Workloads** | [cite_start]Count of workloads (Deployments, Daemonset or Statefulsets) in the fleet with **missing replicas**[cite: 327]. | [cite_start]Count of workloads in the specific cluster with **missing replicas**[cite: 327]. | [cite_start]Indicates failure to reconcile desired states across the organization[cite: 327]. | **Fleet:** Green: 0; Yellow: 1 to 5; Red: 6. **Cluster:** Yellow: 1 to 3; [cite_start]Red: 4[cite: 327]. |
| **CPU Usage % (vs Total)** | [cite_start]Percentage of total fleet CPU capacity currently consumed[cite: 327]. | [cite_start]Percentage of the specific cluster's CPU capacity currently consumed[cite: 327]. | [cite_start]Measures aggregate node saturation risk across the entire estate[cite: 327]. | Yellow: 75% to 95%; [cite_start]Red: 95-100%[cite: 327]. |
| **Memory Usage % (vs Total)** | [cite_start]Percentage of total fleet memory capacity currently consumed[cite: 327]. | [cite_start]Percentage of the specific cluster's memory capacity currently consumed[cite: 327]. | [cite_start]High values indicate risk of aggregate eviction storms across clusters[cite: 327]. | Yellow: 75% to 95%; [cite_start]Red: 95-100%[cite: 327]. |
| **Alerts** | [cite_start]Count of **active high-severity alerts** across the fleet[cite: 327]. | [cite_start]Count of **active high-severity alerts** for this specific cluster[cite: 327]. | [cite_start]Tracking active incidents that require operator intervention[cite: 327]. | **Fleet:** Green: 0; Yellow: 1 to 10; Red: 11. **Cluster:** Yellow: 1 to 5; [cite_start]Red: 6[cite: 327]. |
| **Warning Events** | [cite_start]Volume of **Warning type K8s cluster events** across the fleet[cite: 327]. | [cite_start]Volume of **Warning type K8s cluster events** in this cluster[cite: 327]. | [cite_start]High noise levels obscure critical alerts and indicate config debt[cite: 327]. | **Fleet:** Green: 0; Yellow: 1 to 50; Red: 50. **Cluster:** Yellow: 1 to 20; [cite_start]Red: 21[cite: 327]. |

</details>

---

### 2. Health Tab

The **Health** tab focuses on metrics related to the **availability and current status** of your nodes and pods to detect immediate failure risks.

<details>
<summary>Metrics</summary>


| Metric Name | Scorecard Explanation | Metric Explanation (Cluster/Node) | Importance | Thresholds |
| :--- | :--- | :--- | :--- | :--- |
| **Nodes Total** | [cite_start]Total count of nodes detected in the fleet[cite: 327]. | [cite_start]Count of nodes in the specific cluster[cite: 327]. | [cite_start]Zero indicates fundamental control plane failure[cite: 327]. | [cite_start]Red: 0[cite: 327]. |
| **Nodes Ready** | [cite_start]Percentage of nodes in the fleet reporting **Ready** status[cite: 327]. | [cite_start]Percentage of nodes in the cluster reporting **Ready** status[cite: 327]. | [cite_start]Availability metric; values below 95% signal severe fleet instability[cite: 327]. | **Fleet:** Green: 100%; Yellow: 95% to 99%; Red: < 95%. **Cluster:** Yellow: 99%; [cite_start]Red: < 99%[cite: 327]. |
| **Nodes Memory Pressure** | [cite_start]Percentage of nodes in fleet rejecting pods due to **low memory**[cite: 327]. | [cite_start]Percentage of nodes in cluster rejecting pods due to **low memory** at cluster level[cite: 327]. | [cite_start]Predicts widespread eviction storms[cite: 327]. | **Fleet:** Green: 0%; Yellow: >0% to 5%; Red: $\ge$ 6%. **Cluster:** Yellow: >0% to <1%; [cite_start]Red: $\ge$ 1%[cite: 327]. |
| **Nodes Disk Pressure** | [cite_start]Percentage of nodes in fleet with **low disk availability**[cite: 327]. | [cite_start]Percentage of nodes in cluster with **low disk availability** at cluster level[cite: 327]. | [cite_start]Risk of widespread node failures due to disk exhaustion[cite: 327]. | **Fleet:** Green: 0%; Yellow: >0% to 5%; Red: $\ge$ 6%. **Cluster:** Yellow: >0% to <1%; [cite_start]Red: $\ge$ 1%[cite: 327]. |
| **Pods Running** | [cite_start]Percentage of fleet pods successfully in **Running** phase[cite: 327]. | [cite_start]Percentage of cluster pods successfully in **Running** phase[cite: 327]. | Operational success rate; [cite_start]<90% indicates widespread failure[cite: 327]. | **Fleet:** Green: 95% to 100%; Yellow: 90% to 94%; Red: < 90%. **Cluster:** Yellow: 95% to 99%; [cite_start]Red: < 95%[cite: 327]. |
| **Pods Pending** | [cite_start]Sustained count of pods **waiting to be scheduled** across fleet[cite: 327]. | [cite_start]Sustained count of pods **waiting to be scheduled**[cite: 327]. | [cite_start]Indicates scheduler failure or capacity starvation[cite: 327]. | **Fleet:** Green: 0; Yellow: 1 to 5; Red: $\ge$ 6. **Cluster:** Yellow: 1; [cite_start]Red: $\ge$ 2[cite: 327]. |
| **Container Restarts** | [cite_start]Count of restarts **per period** across the fleet[cite: 327]. | [cite_start]Count of restarts **per period**[cite: 327]. | [cite_start]Symptom of CrashLoopBackOff or OOMKills[cite: 327]. | **Fleet:** Green: 0; Yellow: 1 to 5; Red: 6. **Cluster:** Yellow: 1; [cite_start]Red: 2[cite: 327]. |
| **Network Errors / sec** | [cite_start]Rate of **packet drops/errors** across the fleet[cite: 327]. | [cite_start]Rate of **packet drops/errors**[cite: 327]. | [cite_start]Symptom of hardware failure or systemic CNI issues[cite: 327]. | **Fleet:** Green: < 10; Yellow: 11 to 50; Red: > 50. **Cluster:** Yellow: 5 to 10; [cite_start]Red: > 11[cite: 327]. |

</details>

---

### 3. Performance Tab

The **Performance** tab measures **utilization against total capacity and limits** to identify node saturation and risk of performance degradation.

<details>
<summary>Metrics</summary>


| Metric Name | Scorecard Explanation | Metric Explanation (Cluster/Node) | Importance | Thresholds |
| :--- | :--- | :--- | :--- | :--- |
| **CPU Usage % vs Total** | [cite_start]% of fleet **CPU capacity** used[cite: 327]. | [cite_start]% of cluster/node **CPU capacity** used[cite: 327]. | [cite_start]Measures saturation risk across the entire estate[cite: 327]. | Yellow: 75% to 95%; [cite_start]Red: 95-100%[cite: 327]. |
| **CPU Usage % vs Limits** | [cite_start]% of **CPU hard limits** consumed across fleet[cite: 327]. | [cite_start]% of **CPU hard limits** consumed in cluster[cite: 327]. | [cite_start]Measures proximity to throttling[cite: 327]. | Yellow: 75% to 95%; [cite_start]Red: 95-100%[cite: 327]. |
| **Memory Usage % vs Total** | [cite_start]% of fleet **memory capacity** used across fleet[cite: 327]. | [cite_start]% of cluster/node **memory capacity** used[cite: 327]. | [cite_start]Proximity to eviction thresholds[cite: 327]. | Yellow: 75% to 95%; [cite_start]Red: 95-100%[cite: 327]. |
| **Memory Usage % vs Limits** | [cite_start]% of **RAM hard limits** consumed across fleet[cite: 327]. | [cite_start]% of **RAM hard limits** consumed[cite: 327]. | [cite_start]Proximity to OOMKill events[cite: 327]. | Yellow: 75% to 95%; [cite_start]Red: 95-100%[cite: 327]. |
| **FS Usage %** | [cite_start]% of **disk capacity** consumed across fleet[cite: 327]. | [cite_start]% of **disk capacity** consumed[cite: 327]. | [cite_start]Critical to prevent Kubelet crashes[cite: 327]. | Yellow: 80% to 95%; [cite_start]Red: > 95%[cite: 327]. |
| **Pods Capacity %** | [cite_start]% of **max allowed pod count** scheduled across fleet[cite: 327]. | [cite_start]% of **max allowed pod count** scheduled[cite: 327]. | [cite_start]Measures resource density and slot exhaustion[cite: 327]. | **Fleet:** Green: 85%; Yellow: 85% to 100%; Red: 100%. **Cluster:** Yellow: 85% to 100%; [cite_start]Red: 100%[cite: 327]. |

</details>

---

### 4. Resources Tab

The **Resources** tab analyzes **Request vs. Usage and Request vs. Limits** to assess resource allocation efficiency and identify waste or sizing inaccuracies.

<details>
<summary>Metrics</summary>


| Metric Name | Scorecard Explanation | Metric Explanation (Cluster/Node) | Importance | Thresholds |
| :--- | :--- | :--- | :--- | :--- |
| **CPU Usage % vs Request** | [cite_start]% of **CPU usage vs requests** across fleet[cite: 327]. | [cite_start]% of **CPU usage vs requests**[cite: 327]. | [cite_start]Governance metric for sizing accuracy and waste[cite: 327]. | [cite_start]Yellow: < 70% or > 200%[cite: 327]. |
| **CPU Request % vs Limits** | [cite_start]% of **CPU request vs limit** across fleet[cite: 327]. | [cite_start]% of **CPU request vs limit**[cite: 327]. | [cite_start]Measures performance headroom and burst capacity[cite: 327]. | Yellow: 75% to 95%; [cite_start]Red: 95-100%[cite: 327]. |
| **Memory Usage % vs Request** | [cite_start]% of **memory usage to requests** across fleet[cite: 327]. | [cite_start]% of **memory usage to requests**[cite: 327]. | [cite_start]Governance metric for memory sizing accuracy[cite: 327]. | [cite_start]Yellow: < 70% or > 200%[cite: 327]. |
| **Memory Request % vs Limits** | [cite_start]% of **limit reserved by requests** across fleet[cite: 327]. | [cite_start]% of **limit reserved by requests**[cite: 327]. | [cite_start]Helps to assess resource allocation efficiency and potential overprovisioning[cite: 327]. | Yellow: 75% to 95%; [cite_start]Red: 95-100%[cite: 327]. |
| **Pod Evictions** | [cite_start]Count of **forced pod terminations by Kubelet** in fleet[cite: 327]. | [cite_start]Count of **forced pod terminations by Kubelet**[cite: 327]. | [cite_start]Symptom of node resource starvation[cite: 327]. | **Fleet:** Green: 0; Yellow: 1 to 5; Red: 6. **Cluster:** Yellow: 1; [cite_start]Red: 2[cite: 327]. |

</details>

---

### 5. Workloads Tab

The **Workloads** tab provides an application-centric view, focusing on metrics such as **throttling, restarts, and missing replicas** to diagnose application instability and poor QoS.

<details>
<summary>Metrics</summary>


| Metric Name | Scorecard Explanation | Metric Explanation (Cluster/Node) | Importance | Thresholds |
| :--- | :--- | :--- | :--- | :--- |
| **CPU Usage % vs Limits** | [cite_start]% of **CPU hard limits** consumed by workloads[cite: 327]. | [cite_start]% of **CPU hard limits** consumed by specific workload[cite: 327]. | [cite_start]Measures performance cap proximity[cite: 327]. | Yellow: 75% to 95%; [cite_start]Red: 95-100%[cite: 327]. |
| **CPU Throttling %** | [cite_start]Ratio of **throttled time to active time** across fleet[cite: 327]. | [cite_start]Ratio of **throttled time to active time** for workload[cite: 327]. | [cite_start]Measures latency/lag caused by CFS quotas[cite: 327]. | **Fleet:** Green: < 10%; Yellow: 10% to 50%; Red: 50%. **Workload:** Yellow: 10% to 50%; [cite_start]Red: 50%[cite: 327]. |
| **Memory Usage % vs Limits** | [cite_start]% of **memory limits** consumed across workloads[cite: 327]. | [cite_start]% of **memory limits** consumed by specific workload[cite: 327]. | [cite_start]Proximity to OOMKill[cite: 327]. | Yellow: 75% to 95%; [cite_start]Red: 95-100%[cite: 327]. |
| **Network Errors / sec** | [cite_start]Error rate on **network interfaces** across fleet[cite: 327]. | [cite_start]Error rate on **network interfaces** for workload[cite: 327]. | [cite_start]Signals application failure or connectivity loss[cite: 327]. | **Fleet:** Green: varies (near 0); Yellow: $\le$ 10; Red: $\ge$ 51. **Workload:** Green: $\le$ 2; Yellow: 3 to 15; [cite_start]Red: > 15[cite: 327]. |
| **Pods Total** | [cite_start]Total count of pods in fleet workloads[cite: 327]. | [cite_start]Total count of pods for specific workload[cite: 327]. | [cite_start]Baseline metric for capacity[cite: 327]. | [cite_start]Red: 0[cite: 327]. |
| **Pods Missing** | [cite_start]Gap between **desired and ready replicas** in fleet[cite: 327]. | [cite_start]Gap between **desired and ready replicas** for workload[cite: 327]. | [cite_start]Indicates service degradation[cite: 327]. | **Fleet:** Green: 0; Yellow: 1 to 5; Red: 6. **Workload:** Yellow: 1; [cite_start]Red: 2[cite: 327]. |
| **Pod Evictions** | [cite_start]Count of **evictions** for workloads in fleet[cite: 327]. | [cite_start]Count of **evictions** for specific workload[cite: 327]. | [cite_start]Workload is causing node pressure or misconfigured[cite: 327]. | **Fleet:** Green: 0; Yellow: 1 to 5; Red: 6. **Workload:** Yellow: 1; [cite_start]Red: 2[cite: 327]. |
| **Container Restarts** | [cite_start]Count of **restarts** for workloads in fleet[cite: 327]. | [cite_start]Count of **restarts** for specific workload[cite: 327]. | [cite_start]Persistent failure state (CrashLoopBackOff)[cite: 327]. | **Fleet:** Green: 0; Yellow: 1 to 5; Red: 6. **Workload:** Yellow: 1; [cite_start]Red: 2[cite: 327]. |
| **Container Images Restarting** | [cite_start]Restarts caused by **image/config errors** in fleet[cite: 327]. | [cite_start]Restarts caused by **image/config errors** for workload[cite: 327]. | [cite_start]Specific symptom of registry failure or bad tags[cite: 327]. | **Fleet:** Green: 0; Yellow: 1; Red: 2. **Workload:** Yellow: 1; [cite_start]Red: 2[cite: 327]. |

</details>
Binary file added static/images/k8s-cross-cluster.webp
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading