newrelic · mangulonr · Dec 5, 2025 · Dec 5, 2025 · Dec 5, 2025
diff --git a/...s-pixie/kubernetes-integration/understand-use-data/kubernetes-cross-cluster.mdx b/...s-pixie/kubernetes-integration/understand-use-data/kubernetes-cross-cluster.mdx
@@ -0,0 +1,181 @@
+---
+title: Kubernetes cross-cluster UI (Preview)
+tags:
+  - Integrations
+  - Kubernetes integration
+  - Understand and use data
+metaDescription: Explore and triage your entire fleet of clusters by using K8s cross-cluster UI
+redirects:
+  - /docs/integrations/kubernetes-integration/understand-use-data/kubernetes-cross-cluster
+  - /docs/integrations/kubernetes-integration/cluster-explorer/kubernetes-cross-cluster
+  - /docs/kubernetes-pixie/kubernetes-integration/understand-use-data
+freshnessValidatedDate: never
+---
+
+<Callout title="preview">
+  We're still working on this feature, but we'd love for you to try it out!
+
+  This feature is currently provided as part of a preview program pursuant to our [pre-release policies](/docs/licenses/license-information/referenced-policies/new-relic-pre-release-policy).
+</Callout>
+
+
+# Kubernetes Cross-Cluster UI: Fleet-Wide Clarity for Complex Environments
+
+Modern Kubernetes environments have evolved into complex, **multi-cluster fleets**, but traditional observability tools often provide only a fractured, siloed view of individual clusters. Our **Kubernetes cross-cluster UI** provides a unified command center that transforms this multi-cluster complexity into **fleet-wide clarity**.
+
+This unified view helps Platform Engineers and SREs to:
+* **Unify fleet-wide observability** on a single dashboard compatible with clusters monitored by NR Agents or OpenTelemetry.
+* **Accelerate root cause analysis** with a guided triage workflow.
+* **Reduce costs** by identifying wasted resources across the entire fleet.
+* **Empower developers** with an application-centric view for self-service.
+
+<img
+  title="K8s cross-cluster UI"
+  alt="K8s cross-cluster UI"
+  src="/images/k8s-cross-cluster.webp"
+/>
+
+## Accessing the New UI
+
+You can access the Kubernetes Cross-Cluster UI from the main New Relic menu.
+
+1.  Navigate to the **Kubernetes** option in the main menu.
+2.  Click the blue **"Try it out"** button located in the top right corner of the page.
+
+
+## Sections and Navigation
+
+The Kubernetes cross-cluster UI is structured to provide both a high-level overview and deep-dive capabilities into your entire Kubernetes fleet.
+
+* **Feedback Button**: Allows users to easily provide input on their experience with the UI.
+* **Entity Filterbar**: Offers the ability to drill down and triage issues by applying filters based on various **tags/values** at the **cluster or node level**.
+    * **Cluster Filter**: Enables cluster filtering to isolate the UI to a single or a smaller set of Kubernetes clusters.
+    * **Node Filter**: Enables node filtering to isolate the UI to a single or a smaller set of worker nodes (hosts).
+* **Colored Scorecards**: Display relevant high-level metrics to surface the most important issues that need attention across the entire fleet.
+    * Clicking a scorecard **orders the table** in the context of that metric, surfacing the clusters and nodes with the most critical issues at the top.
+    * It also enables a **line chart** to visualize the metric's evolution over the selected time range.
+* **Line Chart**: Provides the **evolution of the metric** over the selected time frame for trend analysis.
+* **Table**: Lists all entities in the fleet and provides **detailed metric data** for each cluster under the five main tabs.
+    * **Health Focus**: The table provides a filter to only list "unhealthy" entities to focus your attention where it is really needed.
+    * **Search**: A text search bar allows you to filter table rows based on **free text** entered.
+    * **Color-Coding**: Like the scorecards, table cells are color-coded (Green, Yellow, Red) to quickly draw attention to entities with issues that meet a specific **severity threshold**.
+
+##  Drill-Down Capabilities for Triage
+
+The cross-cluster UI is specifically designed to accelerate root cause analysis with a **guided triage workflow**. The primary mechanism for this deep investigation is the ability to drill down from a fleet-wide metric to a single cluster's view:
+
+* **From Scorecard to Cluster Ranking:** When you click a **Colored Scorecard** (e.g., Unhealthy Pods), the Clusters table immediately re-sorts. This places the clusters with the worst performance for that specific metric at the top, enabling you to identify where the issue is most critical across your entire fleet.
+* **From Table Metric to Kubernetes Navigator:** By clicking a metric value within any cell of the table for a specific cluster or node, the system automatically launches the **Kubernetes Navigator**. The Navigator's view is automatically **predefined/filtered** for that specific cluster and metric of interest, allowing you to continue your investigation with a deep dive.
+
+## Tab-Specific Metrics
+
+The UI is organized into **five main tabs**: **Overview**, **Health**, **Performance**, **Resources**, and **Workloads**, each providing specific metrics both at the aggregate (Scorecard) and cluster/node (Metric) level, along with their associated color-coded thresholds for quick triage.
+
+### 1. Overview Tab
+
+The **Overview** tab provides a general summary of the **operational health, capacity, and risk** across your entire Kubernetes fleet.
+
+<details>
+<summary>Metrics</summary>
+
+
+| Metric Name | Scorecard Explanation | Metric Explanation (Cluster/Node) | Importance | Thresholds |
+| :--- | :--- | :--- | :--- | :--- |
+| **Total Clusters** | [cite_start]The total count of Kubernetes clusters currently reporting telemetry[cite: 327]. | N/A | [cite_start]Verifies the connectivity and scope of your entire fleet monitoring estate[cite: 327]. | N/A |
+| **Unhealthy Nodes** | [cite_start]Percentage of nodes in the fleet reporting **NotReady** or **Unknown**[cite: 327]. | [cite_start]Percentage of nodes in the specific cluster reporting **NotReady**[cite: 327]. | [cite_start]Signals widespread instability and potential zone failures across the infrastructure[cite: 327]. | **Fleet:** Green: 0%; Yellow: 1% to 5%; Red: 5%. **Cluster:** Yellow: 1% to 5%; [cite_start]Red: 5%[cite: 327]. |
+| **Unhealthy Pods** | [cite_start]Percentage of pods in the fleet in **Pending, Failed, or Unknown** states[cite: 327]. | [cite_start]Percentage of pods in the specific cluster in **Pending, Failed, or Unknown** states[cite: 327]. | [cite_start]Measures widespread application instability across all clusters[cite: 327]. | **Fleet:** Green: < 1%; Yellow: 1% to 5%; Red: 5%. **Cluster:** Yellow: 1% to 5%; [cite_start]Red: 5%[cite: 327]. |
+| **Unhealthy Workloads** | [cite_start]Count of workloads (Deployments, Daemonset or Statefulsets) in the fleet with **missing replicas**[cite: 327]. | [cite_start]Count of workloads in the specific cluster with **missing replicas**[cite: 327]. | [cite_start]Indicates failure to reconcile desired states across the organization[cite: 327]. | **Fleet:** Green: 0; Yellow: 1 to 5; Red: 6. **Cluster:** Yellow: 1 to 3; [cite_start]Red: 4[cite: 327]. |
+| **CPU Usage % (vs Total)** | [cite_start]Percentage of total fleet CPU capacity currently consumed[cite: 327]. | [cite_start]Percentage of the specific cluster's CPU capacity currently consumed[cite: 327]. | [cite_start]Measures aggregate node saturation risk across the entire estate[cite: 327]. | Yellow: 75% to 95%; [cite_start]Red: 95-100%[cite: 327]. |
+| **Memory Usage % (vs Total)** | [cite_start]Percentage of total fleet memory capacity currently consumed[cite: 327]. | [cite_start]Percentage of the specific cluster's memory capacity currently consumed[cite: 327]. | [cite_start]High values indicate risk of aggregate eviction storms across clusters[cite: 327]. | Yellow: 75% to 95%; [cite_start]Red: 95-100%[cite: 327]. |
+| **Alerts** | [cite_start]Count of **active high-severity alerts** across the fleet[cite: 327]. | [cite_start]Count of **active high-severity alerts** for this specific cluster[cite: 327]. | [cite_start]Tracking active incidents that require operator intervention[cite: 327]. | **Fleet:** Green: 0; Yellow: 1 to 10; Red: 11. **Cluster:** Yellow: 1 to 5; [cite_start]Red: 6[cite: 327]. |
+| **Warning Events** | [cite_start]Volume of **Warning type K8s cluster events** across the fleet[cite: 327]. | [cite_start]Volume of **Warning type K8s cluster events** in this cluster[cite: 327]. | [cite_start]High noise levels obscure critical alerts and indicate config debt[cite: 327]. | **Fleet:** Green: 0; Yellow: 1 to 50; Red: 50. **Cluster:** Yellow: 1 to 20; [cite_start]Red: 21[cite: 327]. |
+
+</details>
+
+---
+
+### 2. Health Tab
+
+The **Health** tab focuses on metrics related to the **availability and current status** of your nodes and pods to detect immediate failure risks.
+
+<details>
+<summary>Metrics</summary>
+
+
+| Metric Name | Scorecard Explanation | Metric Explanation (Cluster/Node) | Importance | Thresholds |
+| :--- | :--- | :--- | :--- | :--- |
+| **Nodes Total** | [cite_start]Total count of nodes detected in the fleet[cite: 327]. | [cite_start]Count of nodes in the specific cluster[cite: 327]. | [cite_start]Zero indicates fundamental control plane failure[cite: 327]. | [cite_start]Red: 0[cite: 327]. |
+| **Nodes Ready** | [cite_start]Percentage of nodes in the fleet reporting **Ready** status[cite: 327]. | [cite_start]Percentage of nodes in the cluster reporting **Ready** status[cite: 327]. | [cite_start]Availability metric; values below 95% signal severe fleet instability[cite: 327]. | **Fleet:** Green: 100%; Yellow: 95% to 99%; Red: < 95%. **Cluster:** Yellow: 99%; [cite_start]Red: < 99%[cite: 327]. |
+| **Nodes Memory Pressure** | [cite_start]Percentage of nodes in fleet rejecting pods due to **low memory**[cite: 327]. | [cite_start]Percentage of nodes in cluster rejecting pods due to **low memory** at cluster level[cite: 327]. | [cite_start]Predicts widespread eviction storms[cite: 327]. | **Fleet:** Green: 0%; Yellow: >0% to 5%; Red: $\ge$ 6%. **Cluster:** Yellow: >0% to <1%; [cite_start]Red: $\ge$ 1%[cite: 327]. |
+| **Nodes Disk Pressure** | [cite_start]Percentage of nodes in fleet with **low disk availability**[cite: 327]. | [cite_start]Percentage of nodes in cluster with **low disk availability** at cluster level[cite: 327]. | [cite_start]Risk of widespread node failures due to disk exhaustion[cite: 327]. | **Fleet:** Green: 0%; Yellow: >0% to 5%; Red: $\ge$ 6%. **Cluster:** Yellow: >0% to <1%; [cite_start]Red: $\ge$ 1%[cite: 327]. |
+| **Pods Running** | [cite_start]Percentage of fleet pods successfully in **Running** phase[cite: 327]. | [cite_start]Percentage of cluster pods successfully in **Running** phase[cite: 327]. | Operational success rate; [cite_start]<90% indicates widespread failure[cite: 327]. | **Fleet:** Green: 95% to 100%; Yellow: 90% to 94%; Red: < 90%. **Cluster:** Yellow: 95% to 99%; [cite_start]Red: < 95%[cite: 327]. |
+| **Pods Pending** | [cite_start]Sustained count of pods **waiting to be scheduled** across fleet[cite: 327]. | [cite_start]Sustained count of pods **waiting to be scheduled**[cite: 327]. | [cite_start]Indicates scheduler failure or capacity starvation[cite: 327]. | **Fleet:** Green: 0; Yellow: 1 to 5; Red: $\ge$ 6. **Cluster:** Yellow: 1; [cite_start]Red: $\ge$ 2[cite: 327]. |
+| **Container Restarts** | [cite_start]Count of restarts **per period** across the fleet[cite: 327]. | [cite_start]Count of restarts **per period**[cite: 327]. | [cite_start]Symptom of CrashLoopBackOff or OOMKills[cite: 327]. | **Fleet:** Green: 0; Yellow: 1 to 5; Red: 6. **Cluster:** Yellow: 1; [cite_start]Red: 2[cite: 327]. |
+| **Network Errors / sec** | [cite_start]Rate of **packet drops/errors** across the fleet[cite: 327]. | [cite_start]Rate of **packet drops/errors**[cite: 327]. | [cite_start]Symptom of hardware failure or systemic CNI issues[cite: 327]. | **Fleet:** Green: < 10; Yellow: 11 to 50; Red: > 50. **Cluster:** Yellow: 5 to 10; [cite_start]Red: > 11[cite: 327]. |
+
+</details>
+
+---
+
+### 3. Performance Tab
+
+The **Performance** tab measures **utilization against total capacity and limits** to identify node saturation and risk of performance degradation.
+
+<details>
+<summary>Metrics</summary>
+
+
+| Metric Name | Scorecard Explanation | Metric Explanation (Cluster/Node) | Importance | Thresholds |
+| :--- | :--- | :--- | :--- | :--- |
+| **CPU Usage % vs Total** | [cite_start]% of fleet **CPU capacity** used[cite: 327]. | [cite_start]% of cluster/node **CPU capacity** used[cite: 327]. | [cite_start]Measures saturation risk across the entire estate[cite: 327]. | Yellow: 75% to 95%; [cite_start]Red: 95-100%[cite: 327]. |
+| **CPU Usage % vs Limits** | [cite_start]% of **CPU hard limits** consumed across fleet[cite: 327]. | [cite_start]% of **CPU hard limits** consumed in cluster[cite: 327]. | [cite_start]Measures proximity to throttling[cite: 327]. | Yellow: 75% to 95%; [cite_start]Red: 95-100%[cite: 327]. |
+| **Memory Usage % vs Total** | [cite_start]% of fleet **memory capacity** used across fleet[cite: 327]. | [cite_start]% of cluster/node **memory capacity** used[cite: 327]. | [cite_start]Proximity to eviction thresholds[cite: 327]. | Yellow: 75% to 95%; [cite_start]Red: 95-100%[cite: 327]. |
+| **Memory Usage % vs Limits** | [cite_start]% of **RAM hard limits** consumed across fleet[cite: 327]. | [cite_start]% of **RAM hard limits** consumed[cite: 327]. | [cite_start]Proximity to OOMKill events[cite: 327]. | Yellow: 75% to 95%; [cite_start]Red: 95-100%[cite: 327]. |
+| **FS Usage %** | [cite_start]% of **disk capacity** consumed across fleet[cite: 327]. | [cite_start]% of **disk capacity** consumed[cite: 327]. | [cite_start]Critical to prevent Kubelet crashes[cite: 327]. | Yellow: 80% to 95%; [cite_start]Red: > 95%[cite: 327]. |
+| **Pods Capacity %** | [cite_start]% of **max allowed pod count** scheduled across fleet[cite: 327]. | [cite_start]% of **max allowed pod count** scheduled[cite: 327]. | [cite_start]Measures resource density and slot exhaustion[cite: 327]. | **Fleet:** Green: 85%; Yellow: 85% to 100%; Red: 100%. **Cluster:** Yellow: 85% to 100%; [cite_start]Red: 100%[cite: 327]. |
+
+</details>
+
+---
+
+### 4. Resources Tab
+
+The **Resources** tab analyzes **Request vs. Usage and Request vs. Limits** to assess resource allocation efficiency and identify waste or sizing inaccuracies.
+
+<details>
+<summary>Metrics</summary>
+
+
+| Metric Name | Scorecard Explanation | Metric Explanation (Cluster/Node) | Importance | Thresholds |
+| :--- | :--- | :--- | :--- | :--- |
+| **CPU Usage % vs Request** | [cite_start]% of **CPU usage vs requests** across fleet[cite: 327]. | [cite_start]% of **CPU usage vs requests**[cite: 327]. | [cite_start]Governance metric for sizing accuracy and waste[cite: 327]. | [cite_start]Yellow: < 70% or > 200%[cite: 327]. |
+| **CPU Request % vs Limits** | [cite_start]% of **CPU request vs limit** across fleet[cite: 327]. | [cite_start]% of **CPU request vs limit**[cite: 327]. | [cite_start]Measures performance headroom and burst capacity[cite: 327]. | Yellow: 75% to 95%; [cite_start]Red: 95-100%[cite: 327]. |
+| **Memory Usage % vs Request** | [cite_start]% of **memory usage to requests** across fleet[cite: 327]. | [cite_start]% of **memory usage to requests**[cite: 327]. | [cite_start]Governance metric for memory sizing accuracy[cite: 327]. | [cite_start]Yellow: < 70% or > 200%[cite: 327]. |
+| **Memory Request % vs Limits** | [cite_start]% of **limit reserved by requests** across fleet[cite: 327]. | [cite_start]% of **limit reserved by requests**[cite: 327]. | [cite_start]Helps to assess resource allocation efficiency and potential overprovisioning[cite: 327]. | Yellow: 75% to 95%; [cite_start]Red: 95-100%[cite: 327]. |
+| **Pod Evictions** | [cite_start]Count of **forced pod terminations by Kubelet** in fleet[cite: 327]. | [cite_start]Count of **forced pod terminations by Kubelet**[cite: 327]. | [cite_start]Symptom of node resource starvation[cite: 327]. | **Fleet:** Green: 0; Yellow: 1 to 5; Red: 6. **Cluster:** Yellow: 1; [cite_start]Red: 2[cite: 327]. |
+
+</details>
+
+---
+
+### 5. Workloads Tab
+
+The **Workloads** tab provides an application-centric view, focusing on metrics such as **throttling, restarts, and missing replicas** to diagnose application instability and poor QoS.
+
+<details>
+<summary>Metrics</summary>
+
+
+| Metric Name | Scorecard Explanation | Metric Explanation (Cluster/Node) | Importance | Thresholds |
+| :--- | :--- | :--- | :--- | :--- |
+| **CPU Usage % vs Limits** | [cite_start]% of **CPU hard limits** consumed by workloads[cite: 327]. | [cite_start]% of **CPU hard limits** consumed by specific workload[cite: 327]. | [cite_start]Measures performance cap proximity[cite: 327]. | Yellow: 75% to 95%; [cite_start]Red: 95-100%[cite: 327]. |
+| **CPU Throttling %** | [cite_start]Ratio of **throttled time to active time** across fleet[cite: 327]. | [cite_start]Ratio of **throttled time to active time** for workload[cite: 327]. | [cite_start]Measures latency/lag caused by CFS quotas[cite: 327]. | **Fleet:** Green: < 10%; Yellow: 10% to 50%; Red: 50%. **Workload:** Yellow: 10% to 50%; [cite_start]Red: 50%[cite: 327]. |
+| **Memory Usage % vs Limits** | [cite_start]% of **memory limits** consumed across workloads[cite: 327]. | [cite_start]% of **memory limits** consumed by specific workload[cite: 327]. | [cite_start]Proximity to OOMKill[cite: 327]. | Yellow: 75% to 95%; [cite_start]Red: 95-100%[cite: 327]. |
+| **Network Errors / sec** | [cite_start]Error rate on **network interfaces** across fleet[cite: 327]. | [cite_start]Error rate on **network interfaces** for workload[cite: 327]. | [cite_start]Signals application failure or connectivity loss[cite: 327]. | **Fleet:** Green: varies (near 0); Yellow: $\le$ 10; Red: $\ge$ 51. **Workload:** Green: $\le$ 2; Yellow: 3 to 15; [cite_start]Red: > 15[cite: 327]. |
+| **Pods Total** | [cite_start]Total count of pods in fleet workloads[cite: 327]. | [cite_start]Total count of pods for specific workload[cite: 327]. | [cite_start]Baseline metric for capacity[cite: 327]. | [cite_start]Red: 0[cite: 327]. |
+| **Pods Missing** | [cite_start]Gap between **desired and ready replicas** in fleet[cite: 327]. | [cite_start]Gap between **desired and ready replicas** for workload[cite: 327]. | [cite_start]Indicates service degradation[cite: 327]. | **Fleet:** Green: 0; Yellow: 1 to 5; Red: 6. **Workload:** Yellow: 1; [cite_start]Red: 2[cite: 327]. |
+| **Pod Evictions** | [cite_start]Count of **evictions** for workloads in fleet[cite: 327]. | [cite_start]Count of **evictions** for specific workload[cite: 327]. | [cite_start]Workload is causing node pressure or misconfigured[cite: 327]. | **Fleet:** Green: 0; Yellow: 1 to 5; Red: 6. **Workload:** Yellow: 1; [cite_start]Red: 2[cite: 327]. |
+| **Container Restarts** | [cite_start]Count of **restarts** for workloads in fleet[cite: 327]. | [cite_start]Count of **restarts** for specific workload[cite: 327]. | [cite_start]Persistent failure state (CrashLoopBackOff)[cite: 327]. | **Fleet:** Green: 0; Yellow: 1 to 5; Red: 6. **Workload:** Yellow: 1; [cite_start]Red: 2[cite: 327]. |
+| **Container Images Restarting** | [cite_start]Restarts caused by **image/config errors** in fleet[cite: 327]. | [cite_start]Restarts caused by **image/config errors** for workload[cite: 327]. | [cite_start]Specific symptom of registry failure or bad tags[cite: 327]. | **Fleet:** Green: 0; Yellow: 1; Red: 2. **Workload:** Yellow: 1; [cite_start]Red: 2[cite: 327]. |
+
+</details>
diff --git a/static/images/k8s-cross-cluster.webp b/static/images/k8s-cross-cluster.webp