K8 Cluster Entities Dashboard¶

Overview¶

The K8 Cluster Entities dashboard offers detailed visibility into resource utilization across nodes, pods, and microservices (containers) within a DataOS-managed Kubernetes cluster using Prometheus as the source. It extends beyond high-level dashboards by enabling precise analysis of performance issues, workload planning, and resource efficiency for each entity.

Supporting both real-time monitoring and historical trend analysis, the dashboard is essential for capacity planning, troubleshooting, and optimization. Users can examine metrics from the cluster level down to individual node performance, helping catch bottlenecks and resource constraints before they affect service availability. This detailed insight enables data-driven decisions about scaling, resource allocation, and workload placement.

The dashboard is organized into three key sections:

Node Resource Overview
Pod Resource Overview
Microservices (Container Name) Resources Overview

Node overview¶

This section shows per-node statistics including pod limits, active pods, resource usage percentages (CPU, memory, disk), and absolute values for CPU cores and memory. This overview provides a quick assessment of node health and capacity utilization, enabling proactive resource management and scaling decisions. The panel displays current values and trends that help identify potential bottlenecks or performance issues before they impact service availability.

Node details¶

This panel offers a per-node breakdown of CPU, memory, disk, and pod utilization metrics across three nodes in the sentinel namespace and pro-alien.dataos.app environment.

Each row represents a single node and exposes critical information used to evaluate scheduling pressure, resource overcommitment, and provisioning behavior.

Below are the color ranges based on the monitoring thresholds:

Range (%)	Representative Color	Hex / RGB	Color Name
0–10%	Dark green	`rgb(41, 156, 70)` ≈ `#299C46`	Dark Green
11–30%	Light green	`rgb(126, 191, 80)` → `#7EBF50` and `#75AD3F`	Light Green
31–60%	Yellow	`rgb(242, 204, 12)` ≈ `#F2CC0C`	Yellow
61–80%	Orange	`rgb(255, 165, 0)` ≈ `#FFA500`	Orange
81–100%	Red	`rgb(212, 74, 58)` ≈ `#D44A3A`	Red

💡 In this dashboard, percentage-based metrics such as CPU Usage, Memory Usage, and CPU Limit are rendered using a gradient background color, specifically the `Green-Yellow-Red (by value)` scheme. The color is not applied in discrete bands, but rather through continuous interpolation, meaning each percentage value gets a distinct shade depending on its exact numeric position. For example, `1%`, `5%`, and `10%` CPU usage each show visibly different shades of green, and similarly, `41%`, `45%`, and `55%` memory usage display incremental shades from yellow to orange. This is because the background gradient is dynamically applied across the entire value range present in the table column, not fixed at global thresholds like 0%, 50%, 100%. As a result, visual interpretation must account for this fine-grained gradient scale, where color changes can represent even small metric variations.

Node¶

The Node column identifies the physical or virtual machine registered with the Kubernetes control plane. Each node hosts a kubelet agent and acts as the execution environment for pods. Node names are autogenerated by the infrastructure provider (e.g., AKS). From an operational perspective, this field is essential for tracing resource allocation, debugging workload behavior, and enforcing placement constraints like affinity/anti-affinity or taints.

Query:

kube_node_status_condition{origin_prometheus=~".*",status="true",node=~"^.*$"} == 1

Explanation:

kube_node_status_condition: Reports health conditions of nodes (e.g., Ready, DiskPressure).
origin_prometheus=~".*": Includes all Prometheus origins.
status="true": Filters only conditions marked as true.
node=~"^.*$": Matches all nodes using regex.
== 1: Confirms the node condition is truly healthy.

Purpose: Ensures that only healthy (e.g., Ready) nodes are included in the analysis.

Pod Limit¶

Pod Limit represents the maximum number of pods a node can support. This limit is imposed by the kubelet and is influenced by the --max-pods flag and the network plugin in use, as certain CNIs assign a fixed number of IP addresses per node. This value defines the scheduling ceiling for the node: even if CPU and memory are available, no new pods will be placed once this threshold is reached. Monitoring this helps anticipate scheduling failures and drive node scaling decisions.

Let's break down how pod limits work in simple terms:

A pod limit represents the maximum number of containers a node can safely manage. Think of it like determining how many boxes can fit in a room while ensuring:

Adequate space for movement (network addresses)
Structural support (memory capacity)
Proper air circulation (CPU resources)

The system automatically determines this limit by evaluating:

The node's default maximum (typically 110 pods)
Available network addresses (ensuring each pod has a unique address)
The node's memory capacity (confirming sufficient CPU and memory for all pods)

The system uses the lowest of these values as the final limit. For instance, if a node could theoretically support 110 pods but only has 100 network addresses available, the limit becomes 100. This approach prevents node overload and maintains smooth pod operation, similar to avoiding room overcrowding.

The pod limit is calculated by the following Prometheus query:

count(kube_pod_info{origin_prometheus=~"",created_by_kind!~"<none>|Job",node=~"^.*$"}) by (node)

Components:

kube_pod_info: A metric that provides basic information about pods⁠⁠

origin_prometheus=~"": Filters metrics based on their Prometheus origin created_by_kind!~"|Job": Excludes pods that have no creator or were created by Jobs

node=~"^.*$": Matches all nodes using a regex pattern

count(...) by (node): Counts the number of pods and groups results by node

Number of Pods¶

This column reflects the current number of pods actively running on the node. This includes all system and workload pods that have passed the scheduling phase and are bound to the node. Comparing this value against the Pod Limit helps determine whether the node is approaching saturation in terms of pod density. High pod counts may also signal a need to scale horizontally, especially when service quality degrades despite low resource usage.

PromQL Query:

count(kube_pod_info{created_by_kind!~"<none>|Job", node=~"^.*$"}) by (node)

Let's break down each component of this Prometheus query:

count(...): Counts the total number of time series that match the conditions inside the parentheses
kube_pod_info: A metric that provides basic information about pods in the Kubernetes cluster
created_by_kind!~"|Job": Excludes pods that either have no creator or were created by Jobs
node=~"^.*$": A regex pattern that matches all node names
by (node): Groups the count results by node name, giving us a per-node pod count

This query is used to determine the current number of actively running pods on each node, excluding temporary job pods

CPU Usage %¶

CPU Usage% shows the real-time percentage of CPU consumed by all workloads and system processes on the node, relative to the node’s allocatable CPU capacity. Unlike requests or limits, this reflects actual compute activity, making it a direct indicator of performance pressure. Persistent high CPU usage may result in throttling, reduced throughput, or pod eviction in overcommitted environments.

PromQL Query:

sum(irate(container_cpu_usage_seconds_total{container!="", node=~"^.*$"}[2m])) by (node)
/
sum(kube_node_status_allocatable{resource="cpu", unit="core", node=~"^.*$"}) by (node)

Explanation:

container_cpu_usage_seconds_total: Tracks CPU time consumed by containers.
irate(...[2m]): Calculates per-second CPU usage over the last 2 minutes.
container!="": Filters out infra/hidden containers.
sum(... by node): Aggregates CPU usage per node.

Purpose: Shows real-time CPU consumption across all user workloads.

Memory Usage %¶

Memory Usage% reflects the actual memory consumption on the node relative to its allocatable memory. It includes memory used by containers, the kubelet, and system daemons, but excludes buffers and cache if not adjusted. Unlike CPU, memory cannot be throttled; sustained high usage can lead to kernel-level out-of-memory (OOM) kills. This metric is critical for preemptive capacity planning and early detection of memory leaks or misbehaving pods.

PromQL Query:

sum(container_memory_working_set_bytes{container!="", node=~"^.*$"}) by (node)
/
sum(kube_node_status_allocatable{resource="memory", unit="byte", node=~"^.*$"}) by (node)

Let's break down this Prometheus query that calculates memory usage percentage:

Numerator:

sum(container_memory_working_set_bytes{container!="", node=~"^.*$"}) by (node)
- Measures actual memory usage in bytes for all containers
- container!="" filters out non-container memory usage
- node=~"^.*$" matches all nodes using regex
- sum(...) by (node) aggregates memory usage per node

Denominator:

sum(kube_node_status_allocatable{resource="memory", unit="byte", node=~"^.*$"}) by (node)
- Represents the total allocatable memory on each node
- Excludes memory reserved for system operations

The division operation produces a ratio showing what percentage of each node's available memory is currently being used

Disk Usage %¶

This metric captures the percentage of local disk space utilized on the node. It typically includes container images, ephemeral storage volumes (emptyDir), logs, and kubelet data. While it does not include persistent volumes (like PVCs), this metric is vital for ensuring that root volumes and node-local storage do not become bottlenecks. Exceeding safe thresholds can cause pod failures, logging issues, and eventually node taints due to disk pressure.

PromQL Query:

sum(container_fs_usage_bytes{device=~"^/dev/.*$", id="/", node=~"^.*$"}) by (node)
/
sum(container_fs_limit_bytes{device=~"^/dev/.*$", id="/", node=~"^.*$"}) by (node)

Let's break down the components of this Prometheus query that calculates disk usage percentage:

Numerator:

sum(container_fs_usage_bytes{device=~"^/dev/.*$", id="/", node=~"^.*$"}) by (node)
- Measures actual filesystem usage in bytes
- device=~"^/dev/.*$" filters for physical devices
- id="/" specifies the root filesystem
- Group results by node name

Denominator:

sum(container_fs_limit_bytes{device=~"^/dev/.*$", id="/", node=~"^.*$"}) by (node)
- Measures total filesystem capacity in bytes
- Uses the same filtering criteria as the numerator
- Division produces a percentage of disk space used per node

CPU Total Cores¶

CPU Total Cores represents the number of logical CPU cores allocatable to pods on the node. This is the denominator used to calculate CPU-related metrics such as usage, requests, and limits. It excludes cores reserved for the operating system and is a fixed property of the node’s instance type or physical configuration.

PromQL Query:

kube_node_status_allocatable{resource="cpu", unit="core", node=~"^.*$"}

Let's break down each component of this Prometheus query:

kube_node_status_allocatable: A metric that shows the resources available for scheduling on each node
resource="cpu": Specifies that we're looking at CPU resources specifically
unit="core": Indicates that the measurement is in CPU cores
node=~"^.*$": A regex pattern that matches all node names, effectively selecting all nodes in the cluster

This query returns the total number of CPU cores that are available for pod scheduling on each node, excluding any resources reserved for system processes.

CPU Core Usage¶

This value shows the actual number of CPU cores currently in use on the node, as a raw figure rather than a percentage. It enables precise cross-node comparisons, especially in heterogeneous environments where some nodes have more physical resources than others. This is particularly useful for performance profiling or validating load distribution.

PromQL Query:

sum(irate(container_cpu_usage_seconds_total{container!="", node=~"^.*$"}[2m])) by (node)

Let's break down each component of this Prometheus query:

sum(...) by (node): Aggregates the CPU usage values and groups them by node
irate(...[2m]): Calculates the per-second rate of CPU usage over the last 2-minute interval
container_cpu_usage_seconds_total: The metric that tracks total CPU time consumed by containers
container!="": Filters out infrastructure or hidden containers
node=~"^.*$": A regex pattern that matches all node names in the cluster

This query is designed to show real-time CPU consumption across all user workloads

Total Memory¶

Total Memory indicates the total allocatable memory on the node, excluding the memory reserved for the operating system and system daemons. This forms the basis for calculating memory requests, limits, and usage percentages. It is a static property determined by the instance type or physical configuration of the node.

PromQL Query:

kube_node_status_allocatable{resource="memory", unit="byte", node=~"^.*$"}

Let's break down each component of this Prometheus query:

kube_node_status_allocatable: A metric that shows the allocatable resources on each node
resource="memory": Specifies that we're querying for memory resources specifically
unit="byte": Indicates that the measurement is in bytenode=~"^.*$": A regex pattern that matches all node names in the cluster

This query returns the total amount of memory that is available for pod scheduling on each node, excluding any resources reserved for system processes

Memory Usage¶

This column shows the absolute amount of memory (in MiB or GiB) currently being used on the node. It includes both workload and system memory consumption and provides a more interpretable value than the corresponding percentage column. This is critical for memory-constrained environments and for validating if workloads are adhering to their assigned resource budgets.

PromQL Query:

sum(container_memory_working_set_bytes{container!="", node=~"^.*$"}) by (node)

Let's break down each component of this memory usage query:

sum(...) by (node): Aggregates memory usage values and groups results by node name
container_memory_working_set_bytes: Measures actual memory usage in bytes for all containers
container!="": Filters out non-container memory usage
node=~"^.*$": A regex pattern that matches all node names in the cluster

This query is designed to show the actual amount of memory being used by containers across all nodes

Disk Total¶

This value reflects the total available node-local storage capacity on the node. It includes the filesystem volume used for container storage, image layers, and logs, but does not include network-attached or persistent volumes provisioned through PVCs. It is used in conjunction with disk usage to monitor local storage health.

PromQL Query:

sum(container_fs_limit_bytes{device=~"^/dev/.*$", id="/", node=~"^.*$"}) by (node)

Let's break down each component of this Prometheus query:

sum(...) by (node): Aggregates filesystem limit values and groups results by node name
container_fs_limit_bytes: Measures the total filesystem capacity in bytes
device=~"^/dev/.*$": A filter that matches only physical devices
id="/": Specifies that we're looking at the root filesystem
node=~"^.*$": A regex pattern that matches all node names, effectively selecting all nodes in the cluster

Disk Usage¶

Disk Usage shows the absolute amount of node-local disk consumed, usually measured in GiB. This metric is important for alerting on log saturation, image bloat, or dangling volumes. Since Kubernetes will taint a node under disk pressure, persistent monitoring of this value is necessary for maintaining pod availability and scheduling stability.

PromQL Query:

sum(container_fs_usage_bytes{device=~"^/dev/.*$", id="/", node=~"^.*$"}) by (node)

Let's break down each component of this Prometheus query:

sum(...) by (node): Aggregates filesystem usage values and groups results by node name
container_fs_usage_bytes: Measures the actual filesystem usage in bytes for containers
device=~"^/dev/.*$": A filter that matches only physical devices
id="/": Specifies that we're looking at the root filesystem
node=~"^.*$": A regex pattern that matches all node names in the cluster

This query is designed to show the actual amount of disk space being used across all nodes

CPU Requests %¶

This column shows the total amount of CPU requested by all pods on the node as a percentage of the node’s allocatable CPU capacity. Requests are soft guarantees and are used by the Kubernetes scheduler to determine where pods should be placed. Values exceeding 100% indicate that more CPU has been requested than the node can technically provide, which can lead to CPU contention during peak usage.

PromQL Query:

sum(kube_pod_container_resource_requests{resource="cpu", unit="core", node=~"^.*$"}) by (node)
/
sum(kube_node_status_allocatable{resource="cpu", unit="core", node=~"^.*$"}) by (node)

Let's break down this PromQL query that calculates CPU requests as a percentage.

Numerator:

sum(kube_pod_container_resource_requests{resource="cpu", unit="core", node=~"^.*$"}) by (node)
- Sums up all CPU resource requests for containers on each node
- resource="cpu" specifies we're looking at CPU requests
- unit="core" indicates measurement in CPU cores
- Groups results by node name

Denominator:

sum(kube_node_status_allocatable{resource="cpu", unit="core", node=~"^.*$"}) by (node)
- Measures the total allocatable CPU cores on each node
- Excludes cores reserved for system processes
- Uses the same grouping by node

The division operation produces a ratio showing what percentage of each node's available CPU capacity has been requested by pods

Memory Requests %¶

Memory Requests% represents the sum of all memory requests made by pods on the node, shown as a percentage of the node’s allocatable memory. Unlike limits, requests are used purely for scheduling. If this value is high, it reduces the node’s ability to accept additional pods, even if actual memory usage is low. Persistent over-requesting leads to underutilization, whereas under-requesting can lead to OOM events during bursts.

PromQL Query:

sum(kube_pod_container_resource_requests{resource="memory", unit="byte", node=~"^.*$"}) by (node)
/
sum(kube_node_status_allocatable{resource="memory", unit="byte", node=~"^.*$"}) by (node)

Let's break down each component of this PromQL query:

Numerator:
- sum(...) by (node) - Aggregates memory requests for each node
- kube_pod_container_resource_requests - Metric that tracks resource requests for containers
- resource="memory" - Specifies we're looking at memory requests
- unit="byte" - Indicates measurement in bytes
- node=~"^.*$" - Matches all node names
Denominator:
- sum(...) by (node) - Aggregates allocatable memory for each node
- kube_node_status_allocatable - Metric showing available resources on nodes
- Same filters applied as the numerator for consistency

The division of these components produces a ratio showing what percentage of each node's available memory has been requested by pods.

CPU Limit %¶

CPU Limit% shows the total amount of CPU that pods on the node are allowed to use, relative to the node’s allocatable capacity. These limits are enforced by the container runtime (using Linux cgroups) and define the maximum CPU a pod can consume. If limits are excessively high, it may result in unfair CPU scheduling or runtime contention, especially under load.

PromQL Query:

sum(kube_pod_container_resource_limits{resource="cpu", unit="core", node=~"^.*$"}) by (node)
/
sum(kube_node_status_allocatable{resource="cpu", unit="core", node=~"^.*$"}) by (node)

Let's break down each component of this PromQL query:

Numerator:

sum(...) by (node) - Aggregates CPU limits across all containers per node
kube_pod_container_resource_limits - Metric that tracks resource limits set for containers
resource="cpu" - Specifies we're looking at CPU limits
unit="core" - Indicates measurement in CPU cores
node=~"^.*$" - Matches all node names in the cluster

Denominator:

sum(...) by (node) - Aggregates total allocatable CPU per node
kube_node_status_allocatable - Metric showing available CPU resources on nodes
Same filters applied as the numerator for consistency

The division of these components produces a ratio showing what percentage of each node's available CPU has been set as limits for pods.

Memory Limit %¶

Memory Limit% shows the total capped memory usage allowed by all pods on the node as a percentage of allocatable memory. Memory limits are enforced hard—if a pod exceeds its memory limit, it is terminated. Monitoring this ensures pods are neither overconstrained nor permitted to consume excessive memory, which could starve other workloads.

sum(kube_pod_container_resource_limits{resource="memory",unit="byte",node=~"^.*$"}) by (node) / sum(kube_node_status_allocatable{resource="memory",unit="byte",node=~"^.*$"}) by (node)

Let's break down the components of this PromQL query that calculates memory limits as a percentage:

Numerator:

sum(...) by (node) - Aggregates memory limits for all containers on each node
kube_pod_container_resource_limits - Metric that tracks resource limits set for containers
resource="memory" - Specifies we're looking at memory limits
unit="byte" - Indicates measurement in bytes

Denominator:

kube_node_status_allocatable - Metric showing available memory resources on nodes
The same filters are applied to the numerator for consistency

The division of these components produces a ratio showing what percentage of each node's available memory has been set as limits for pods.

CPU Request & CPU Limit (Cores)¶

These fields provide the actual total CPU requests and limits on the node, measured in cores (e.g., 2.61, 30.8). Unlike percentages, core values offer more intuitive insight into absolute resource allocation. This is especially useful when nodes have varying core counts or when aligning provisioning with known workload benchmarks.

sum(kube_pod_container_resource_requests{origin_prometheus=~"",resource="cpu", unit="core",node=~"^.*$"}) by (node)

Let's break down each component of this PromQL query:

sum(...) by (node): Aggregates the CPU resource requests and groups the results by node name.
kube_pod_container_resource_requests: This metric tracks the resource requests set for containers.
resource="cpu": Specifies that we're looking at CPU requests.
unit="core": Indicates that the measurement is in CPU cores.
node=~"^.*$": A regex pattern that matches all node names in the cluster, effectively selecting all nodes.

sum(kube_pod_container_resource_limits{origin_prometheus=~"",resource="cpu", unit="core",node=~"^.*$"}) by (node)

Let's break down each component of this PromQL query:

sum(...) by (node): Aggregates CPU resource limits and groups the results by node name
kube_pod_container_resource_limits: Metric that tracks the resource limits set for containers
resource="cpu": Specifies that we're looking at CPU limits
unit="core": Indicates that the measurement is in CPU cores
node=~"^.*$": A regex pattern that matches all node names in the cluster, effectively selecting all nodes

Memory Requests & Memory Limits¶

These represent the total requested and capped memory on the node, shown in absolute units (e.g., GiB). This allows infrastructure operators to compare node memory allocation against instance type capacity and verify whether workloads are efficiently sized or need tuning.

PromQL Query:

sum(kube_pod_container_resource_limits{resource="memory", unit="byte", node=~"^.*$"}) by (node)

Let's analyze each part of this PromQL query:

sum(...) by (node) - Aggregates memory limits across all containers and groups results by node name
kube_pod_container_resource_limits - The metric that tracks resource limits configured for containers
resource="memory" - Filter to only look at memory-related limits
unit="byte" - Specifies that measurements are in bytes
node=~"^.*$" - A regex pattern that matches all node names in the cluster

PromQL Query:

sum(kube_pod_container_resource_requests{resource="memory", unit="byte", node=~"^.*$"}) by (node)

Let's break down each component of this PromQL query:

sum(...) by (node): Aggregates memory resource requests and groups results by node name
kube_pod_container_resource_requests: The metric that tracks container resource requests
resource="memory": Filters to only look at memory-related requests
unit="byte": Specifies that measurements are in bytes
node=~"^.*$": A regex pattern that matches all node names in the cluster

Env and Namespace¶

These final columns contextualize node usage by environment (env) and dominant workload ownership (namespace). In multi-tenant or multi-env clusters, this provides visibility into which workloads are consuming resources and allows teams to attribute costs or investigate tenant-specific behaviors. env and namespace are pod labels already present on the kube-state-metrics series for which no separate PromQL query is needed.

Node Memory Ratio¶

This panel summarizes memory usage across all nodes as percentages of the total allocatable memory. It shows three distinct metrics:

Memory Utilization reflects the actual memory in use (15.4% in this case), measured by aggregating memory used by all running containers and system components.
Memory Requests (65.9%) represents the amount of memory requested by all scheduled pods. Kubernetes guarantees this memory availability during scheduling.
Memory Limits (69.1%) indicates the upper bound on memory usage across all pods. If a pod exceeds its limit, it will be terminated by the container runtime.

These values help assess memory efficiency. A large delta between utilization and requests often points to overprovisioning, while a narrow margin between usage and limits could signal risk of memory pressure and OOM kills. Here's a description of each PromQL queries that are used in the Node Memory Ratio panel:

sum(container_memory_working_set_bytes{origin_prometheus=~"",container!="",node=~"^.*$"}) / sum(kube_node_status_allocatable{origin_prometheus=~"",resource="memory", unit="byte", node=~"^.*$"})

This query calculates the actual memory utilization percentage across all nodes. It divides the sum of all container memory working set bytes (actual memory in use) by the total allocatable memory across all nodes. This shows what percentage of available memory is actively being consumed by running containers.

sum(kube_pod_container_resource_requests{origin_prometheus=~"",resource="memory", unit="byte",node=~"^.*$"}) / sum(kube_node_status_allocatable{origin_prometheus=~"",resource="memory", unit="byte", node=~"^.*$"})

This query calculates the memory requests percentage. It sums up all memory requests made by pods across all nodes and divides by the total allocatable memory. This represents what percentage of the cluster's memory has been reserved for guarantees during scheduling.

sum(kube_pod_container_resource_limits{origin_prometheus=~"",resource="memory", unit="byte",node=~"^.*$"}) / sum(kube_node_status_allocatable{origin_prometheus=~"",resource="memory", unit="byte", node=~"^.*$"})

This query calculates the memory limits percentage. It divides the sum of all memory limits defined across pods by the total allocatable memory across all nodes. This shows the upper bound of memory usage that pods are allowed to reach before being terminated.

Node CPU cores¶

This widget shows aggregated CPU metrics across the cluster in core units:

Total Cores refers to the sum of allocatable cores across all nodes (94.1 cores total).
Usage shows the current total CPU actively consumed (6.9 cores), which is a real-time snapshot of workload processing demand.
Requests (167.6 cores) and Limits (232.2 cores) indicate what pods have reserved and are allowed to consume, respectively.

These numbers are critical in clusters with mixed node types. If requests or limits exceed total cores (as shown here), the cluster is overcommitted, which may cause CPU throttling or degraded performance under peak load.

sum(kube_node_status_allocatable{origin_prometheus=~"",resource="cpu", unit="core", node=~"^.*$"})

This PromQL query calculates the total CPU capacity available across all nodes in a Kubernetes cluster. It sums up the allocatable CPU cores from all nodes by using the kube_node_status_allocatable metric, filtering specifically for CPU resources measured in core units. The query applies regex pattern matching to include all nodes in the cluster, providing a comprehensive view of the total CPU resources that can be allocated to workloads.

Node storage¶

This panel outlines the state of node-local ephemeral storage:

Utilization Rate indicates the real-time I/O rate or data inflow. At 0.1 B/s, it shows minimal write activity across nodes.
Usage shows current disk space consumed (517.7 GiB), including logs, container images, and temporary files.
Total is the aggregated storage capacity across nodes (4 TiB).

Storage usage must be monitored to prevent node taints due to disk pressure. A high utilization rate or increasing usage trend can degrade scheduling and container startup times.

sum (container_fs_usage_bytes{origin_prometheus=~"",device=~"^/dev/.*$",id="/",node=~"^.*$"}) / sum (container_fs_limit_bytes{origin_prometheus=~"",device=~"^/dev/.*$",id="/",node=~"^.*$"})

This PromQL query calculates the overall disk usage ratio across all nodes in a Kubernetes cluster. It divides the sum of all container filesystem usage bytes by the sum of all filesystem limits across nodes. The query filters for actual block devices (matching /dev/.* pattern), targets the root filesystem (id="/"), and includes all nodes. This provides a single percentage value representing cluster-wide storage utilization, which is displayed as 0.1 B/s in the Node Storage Information panel.

Node memory¶

This panel gives absolute memory metrics:

Total Memory shows the aggregate allocatable memory across all nodes (528.1 GiB).
Usage is the actual used memory (81.6 GiB).
Requests (347.9 GiB) and Limits (365.1 GiB) represent the total memory reservation and cap values defined in pod specs.

These figures help quantify the gap between requested vs. used memory. Significant over-requesting (as seen here) may lead to wasted resources, whereas narrow margins require careful monitoring to avoid instability during workload bursts.

sum(kube_node_status_allocatable{origin_prometheus=~"",resource="memory", unit="byte", node=~"^.*$"})

This PromQL query calculates the total allocatable memory across all nodes in a Kubernetes cluster. It sums up the memory capacity reported by the kube_node_status_allocatable metric, specifically filtering for memory resources measured in bytes. The query uses a regex pattern (node=~"^.*$") to match all nodes in the cluster, providing a comprehensive view of the total memory that can be allocated to workloads. This metric is essential for capacity planning and utilization analysis, showing the baseline against which memory requests and usage should be compared.

Node CPU Ratio¶

This widget provides percentage-based CPU indicators:

CPU Utilization (6.2%) shows how much of the total CPU is being consumed in real-time.
CPU Requests (178.0%) and CPU Limits (246.6%) illustrate how much CPU has been reserved and capped for pods relative to allocatable capacity.

Both request and limit values exceeding 100% confirm CPU overcommitment. While common in containerized environments, overcommitment must be balanced carefully. Excessive values risk pod throttling or runtime contention when multiple high-demand pods run simultaneously.

sum (irate(container_cpu_usage_seconds_total{origin_prometheus=~"",container!="",node=~"^.*$"}[2m])) / sum(kube_node_status_allocatable{origin_prometheus=~"",resource="cpu", unit="core", node=~"^.*$"})

This PromQL query calculates the overall CPU utilization ratio across all nodes in a Kubernetes cluster. It divides the sum of CPU usage rate (measured over a 2-minute window using the irate function) by the total allocatable CPU cores across all nodes. The query specifically excludes system containers and considers all nodes in the cluster using regex pattern matching. This metric provides a percentage value representing how much of the cluster's total CPU capacity is currently being consumed by all workloads.

Nodes with pods¶

This panel presents a high-level distribution of pods:

Number of Nodes is the total number of active nodes running pods (9 nodes).
Pod Number of Instances reflects the actual number of pods across the cluster (588).
Upper Limit Pod is the sum of all pod limits defined on each node (1830), which acts as the cluster-wide scheduling ceiling.

This information is essential for assessing current pod density and for scaling decisions. If the number of pods nears the upper limit, Kubernetes will not be able to place new pods, even if memory and CPU are available.

sum(kube_node_status_allocatable{origin_prometheus=~"",resource="pods", unit="integer",node=~"^.*$"})

This PromQL query calculates the total maximum number of pods that can be scheduled across all nodes in a Kubernetes cluster. It sums the allocatable pod capacity from each node by using the kube_node_status_allocatable metric, specifically filtering for pod resources measured in integer units. The query uses a regex pattern (node=~"^.*$") to match all nodes in the cluster, providing a comprehensive view of the cluster's total pod capacity. This metric is essential for capacity planning, as it represents the upper limit of pods that can be scheduled before the cluster reaches its pod density ceiling.

Network Overview (Associable nodes and namespaces)¶

The Network Overview panel provides a real-time visualization of aggregate network throughput across all associated Kubernetes nodes and namespaces. It tracks both inbound (receive) and outbound (send) traffic measured in GiB/s, capturing overall data flow between pods, services, nodes, and external systems. The graph illustrates temporal patterns of bandwidth usage, with two separate lines representing the rate of incoming traffic (green) and outgoing traffic (yellow).

The Receive line reflects how much data nodes are ingesting from other nodes, external APIs, or clients. The Send line represents how much traffic is being sent out, whether to upstream services, other workloads, or external consumers. The panel reports a recent spike reaching nearly 9.44 GiB/s (receive) and 9.48 GiB/s (send), suggesting a surge in data processing, API activity, or inter-pod communication.

This panel is powered by metrics collected from node-level agents, which aggregate per-interface statistics and expose them via Prometheus. Values are summed across nodes to give a global view.

Operationally, this panel is essential for detecting:

Sudden traffic surges or DoS-like behavior
Bandwidth saturation is impacting latency-sensitive workloads
Periodic spikes caused by scheduled jobs, data pipelines, or backups

Sustained high throughput should be evaluated against network interface limits of the underlying VM type to prevent packet loss or throttling. It may also necessitate tuning service meshes, ingress rules, or load balancing policies.

sum (irate(container_network_transmit_bytes_total{origin_prometheus=~"",node=~"^.*$",namespace=~".*"}[2m]))*8

This PromQL query calculates the total outbound network traffic rate across all nodes in a Kubernetes cluster. It uses the irate function to measure the rate of change in transmitted bytes over a 2-minute window, aggregating data from all nodes and namespaces. The query sums these values and multiplies by 8 to convert from bytes to bits, which is typically how network bandwidth is measured. This metric is useful for monitoring overall cluster network utilization, detecting traffic spikes, and capacity planning for network resources.

Namespace resources¶

This table provides a consolidated view of the resource footprint of each Kubernetes namespace across several key dimensions: microservices, pods, services, configuration data, and secrets (passwords). It allows platform teams to monitor tenancy boundaries, identify namespace-level scaling trends, and enforce resource governance policies.

Each row represents a namespace, while each column quantifies the number of a specific resource type associated with that namespace.

Spaces¶

This column lists the Kubernetes namespaces. A namespace is a logical grouping mechanism that isolates and organizes Kubernetes resources. Each namespace functions independently, allowing multiple teams or systems to run workloads in parallel without interfering with each other. Namespaces are often used to segment environments, services, or infrastructure components within a single cluster.

Microservices¶

This column indicates the number of distinct microservices deployed in each namespace. A microservice refers to an independent service component typically backed by a Kubernetes workload object such as a Deployment, StatefulSet, or Job. The number shown here represents the total count of such independent service components currently defined in the namespace. It is useful for understanding the architectural granularity and service footprint in each environment.

Pod¶

This column displays the total number of pods currently running in each namespace. A pod is the smallest deployable unit in Kubernetes and may contain one or more containers that share network and storage. This metric reflects the volume of active workloads operating under each namespace. Pod counts are critical for workload sizing, resource capacity assessment, and operational scaling.

SVC¶

This column shows the number of Kubernetes Services defined in each namespace. A Service is an abstract resource that enables stable access to a set of pods. It provides automatic load balancing and internal DNS resolution. The number in this column reflects how many service definitions are present, which can influence traffic routing, service discovery, and exposure models within that namespace.

Configuration¶

This column represents the number of configuration resources created within each namespace. These typically include ConfigMaps, which hold environment-specific parameters, application settings, or other runtime configuration data. These values are mounted into pods or passed as environment variables. This count helps identify the configuration footprint and parameter management complexity in each namespace.

Passwords¶

This column displays the count of secret resources that hold sensitive data within the namespace. These are typically Kubernetes Secrets and may contain passwords, API tokens, certificates, or any confidential string-based values. The secrets are used to securely inject runtime credentials into pods. This count is relevant for assessing the security footprint and secret management needs across namespaces.

Total¶

The row labeled Total aggregates the values from all individual namespaces for each column. It provides a cluster-wide summary of how many microservices, pods, services, configuration objects, and secret entries are present in total. This summary supports overall environment sizing, cluster governance, and trend analysis.

CPU Usage by Node¶

This panel visualizes the percentage of CPU usage per node by calculating the per-second CPU consumption rate over the past 2 minutes and comparing it to each node’s allocatable CPU capacity. It’s useful for identifying CPU bottlenecks at the node level.

sum(irate(container_cpu_usage_seconds_total{origin_prometheus=~"", container!="", node=~"^.*$"}[2m])) by (node)
/
sum(kube_node_status_allocatable{origin_prometheus=~"", resource="cpu", unit="core", node=~"^.*$"}) by (node) * 100

This PromQL query calculates the CPU utilization percentage for each node in a Kubernetes cluster. It divides the per-second CPU consumption rate (measured over a 2-minute window using the irate function) by each node's allocatable CPU capacity, then multiplies by 100 to express the result as a percentage. The query filters out system containers and considers all nodes in the cluster using regex pattern matching. This metric provides node-level visibility into how much of each node's CPU capacity is being actively used, helping to identify potential bottlenecks or imbalanced workload distribution across the cluster.

Node Memory Breakdown¶

This panel shows the percentage of memory currently being used on each node, based on the working set bytes divided by total allocatable memory. It helps detect potential memory saturation or underutilization.

sum(container_memory_working_set_bytes{origin_prometheus=~"", container!="", node=~"^.*$"}) by (node)
/
sum(kube_node_status_allocatable{origin_prometheus=~"", resource="memory", unit="byte", node=~"^.*$"}) by (node) * 100

This PromQL query calculates the percentage of memory currently being used on each node in a Kubernetes cluster. It divides the sum of container memory working set bytes (active memory usage) by the total allocatable memory for each node, then multiplies by 100 to express the result as a percentage. The query filters out system containers and considers all nodes in the cluster using regex pattern matching. This metric provides node-level visibility into memory utilization, helping to identify nodes that may be approaching memory saturation or showing memory resource imbalances across the cluster.

Node disk pressure¶

This table reports whether any node in the cluster is experiencing disk pressure (status="true"). A value of 1 indicates active disk pressure, while 0 signifies normal disk health.

 sum(kube_node_status_condition{condition="DiskPressure",status="true"}) by (node, env)

This PromQL query calculates the sum of nodes experiencing disk pressure within a Kubernetes cluster, grouped by node name and environment. It specifically filters for nodes where the condition "DiskPressure" has a status value of "true", indicating storage constraints. When this metric returns a value of 1 for any node, it signals that the node is running low on available disk space, which could potentially impact pod scheduling and container operations.

Check Node Startup Duration¶

This panel shows the maximum time taken by each node to start the kubelet process. Higher values may signal delayed startups due to infrastructure, bootstrapping, or image pull latency.

max(kubelet_node_startup_duration_seconds) by (node, env)

This PromQL query calculates the maximum startup duration of the kubelet process across all nodes in a Kubernetes cluster. The query uses the max() function on the kubelet_node_startup_duration_seconds metric and groups the results by node name and environment (by (node, env)). This metric provides visibility into how long it takes for the kubelet service to fully initialize on each node, which can help identify nodes experiencing delayed startups due to infrastructure issues, bootstrapping problems, or image pull latency.

Node-wise Image Pull Duration¶

This panel tracks the average image pull durations across nodes, segmented by image size. It helps in debugging container startup performance and identifying slow image pulls due to registry or network issues.

 avg(kubelet_image_pull_duration_seconds_sum{node=~"^.*$"}) by (image_size_in_bytes, node)

This PromQL query calculates the average image pull duration across all nodes in a Kubernetes cluster, grouped by image size and node name. It uses the kubelet_image_pull_duration_seconds_sum metric and filters to include all nodes with the regex pattern node=~"^.*$".

Node network latency¶

This panel is intended to visualize network latency between nodes based on traffic through the kube-rbac-proxy.

rate(node_network_receive_packets_total[5m])

This PromQL query calculates the rate of network packets received by each node in a Kubernetes cluster over a 5-minute time window. It uses the rate() function applied to the node_network_receive_packets_total metric, which tracks the cumulative count of packets received on network interfaces. This metric is useful for monitoring network traffic patterns, detecting potential packet drops, and identifying nodes experiencing unusual network activity or load.

Cluster node health¶

This panel gives a holistic view of node readiness. Useful for monitoring cluster health and quickly spotting unhealthy nodes

The green line shows total nodes (count(kube_node_info)).
The yellow line tracks nodes not in a Ready state (status="false").

count(kube_node_info)

This PromQL query simply counts the total number of nodes in a Kubernetes cluster by using the count() function on the kube_node_info metric. It provides a straightforward measure of cluster size by returning the absolute number of worker nodes currently registered with the Kubernetes control plane, regardless of their status or condition.

sum(kube_node_status_condition{condition="Ready",status="false"})

This PromQL query calculates the total number of Kubernetes nodes that are in a non-ready state across the cluster. It sums all instances where nodes have the condition "Ready" with a status value of "false". When this metric returns a non-zero value, it indicates that one or more nodes are experiencing problems and are not available for workload scheduling. This is a critical metric for cluster health monitoring as it directly impacts application availability and resource capacity.

Avg node spin-up time¶

It tracks the average time taken for nodes to spin up, offering insight into provisioning efficiency.

 rate(kubelet_node_startup_duration_seconds[5m])

This PromQL query calculates the rate of change in node startup duration over a 5-minute time window. It uses the rate() function applied to the kubelet_node_startup_duration_seconds metric, which measures how long it takes for nodes to complete their startup process. This metric is useful for monitoring cluster node provisioning efficiency, detecting changes in node initialization performance, and identifying potential bottlenecks in the node bootstrapping process.

Pre-registration spin-up time¶

This metric evaluates how long it takes for a node to be ready before joining the cluster, highlighting delays during the bootstrap phase. No activity is currently observed.

Query:

 rate(kubelet_node_startup_pre_registration_duration_seconds[5m])

This PromQL query calculates the rate of change in node pre-registration duration over a 5-minute time window. It uses the rate() function applied to the kubelet_node_startup_pre_registration_duration_seconds metric, which measures how long it takes for nodes to complete their pre-registration process before officially joining the cluster. This metric helps identify potential bottlenecks in the early stages of node bootstrapping, such as network connectivity issues, credential verification delays, or other factors that might slow down node integration into the Kubernetes cluster.

Node memory pressure¶

Displays whether any node is under memory pressure. A value of 1 signals high memory usage. The current graph shows all nodes are healthy in terms of memory.

Query:

 kube_node_status_condition{condition="MemoryPressure",status="true"}

This PromQL query (kube_node_status_condition{condition="MemoryPressure",status="true"}) monitors Kubernetes nodes experiencing memory pressure. It specifically filters for nodes where the condition "MemoryPressure" has a status value of "true", which indicates the node is running low on available memory resources. When this metric returns a value of 1 for any node, it signals that the node's memory is constrained, which could impact pod scheduling and container operations. This metric is critical for proactive cluster health monitoring as memory pressure can lead to pod evictions and degraded application performance.

Pod overview¶

The Pod Resource Overview section provides insights into individual pod behavior and resource utilization patterns across the Kubernetes cluster. This visualization helps DevOps teams and SREs identify resource-intensive workloads, potential bottlenecks, and anomalous pod behaviors that may impact overall system performance.

Top 10 CPU Consuming Pods¶

This bar graph visualizes the top 10 pods in terms of CPU usage across the cluster. Each bar represents a pod, with the percentage of total CPU consumed displayed on the right. The longest bar belongs to poros-collation-worker-pods-0, consuming 32.8%, followed by aws-delete-buckets and grafana-* with lesser usage.

topk(10, sum by (pod) (rate(container_cpu_usage_seconds_total{container!="POD", container!=""}[5m])))

This calculates the per-second CPU usage rate over a 5-minute window for each pod, excluding the POD placeholder container, sums by pod, and returns the top 10.

Top 10 Memory Consuming Pods¶

This panel highlights pods that are consuming the most memory. pulsar-broker-0 tops the list at 2.72 GiB, followed by multiple pulsar-bookie-* pods.

topk(10, sum by (pod) (container_memory_working_set_bytes{container!="POD", container!=""}))

This sums the working set memory used by each pod (excluding POD), and lists the top 10 consumers.

Pod CPU utilization¶

This line chart shows CPU usage as a percentage (0–100%) across multiple pods. Each line represents a pod’s CPU consumption over time. This is useful to spot spikes in CPU utilization and understand per-pod behavior.

max(irate(container_cpu_usage_seconds_total{origin_prometheus=~"",pod=~".*",container =~".*",container !="",container!="POD",node=~"^.*$",namespace=~".*"}[2m])) by (container, pod) / (max(container_spec_cpu_quota{origin_prometheus=~"",pod=~".*",container =~".*",container !="",container!="POD",node=~"^.*$",namespace=~".*"}/100000) by (container, pod)) * 100

This PromQL query calculates CPU utilization as a percentage of allocated quota for containers in Kubernetes pods. It first measures the instantaneous CPU usage rate over a 2-minute window using irate(), then divides by the CPU quota (converting from microseconds to cores by dividing by 100000), and multiplies by 100 to express as a percentage. The query excludes empty containers and Kubernetes internal "POD" containers, groups results by container and pod names, and finds the maximum utilization for each. This helps identify pods approaching or exceeding their CPU limits, which could indicate performance issues or resource constraints.

Monitor Pod Restart¶

This panel tracks the top 10 pods with the highest restart counts. It shows constant values indicating no new restarts during the selected time window, but historic restart values are visible. flash-service-training-* has the highest count at 88, followed by lakehouse-ms-* at 50.

 topk(10, sum(kube_pod_container_status_restarts_total) by (pod))

This returns the total container restarts per pod and lists the top 10.

Pod CPU Quota Utilization¶

This graph displays CPU usage as a percentage of the defined quota for each pod. The y-axis reflects values exceeding 1.0 (100%), indicating quota breaches. For example, azure-ip-masq-agent-* exceeds 2.0, showing significant overuse.

max(irate(container_cpu_usage_seconds_total{...}[2m])) by (container, pod)
/
max(container_spec_cpu_quota{...}) by (container, pod) / 100000
* 100

This computes the max rate of CPU usage divided by the CPU quota (converted from microseconds to cores), giving CPU usage as a percentage of quota.

Pending Pod Count¶

This panel tracks the total number of pods across the cluster that are currently in the "Pending" phase. A pod enters the "Pending" state when it is accepted by the Kubernetes system, but one or more of the containers within it have not been created yet, typically due to scheduling delays or image pulls. The graph consistently shows a value of 18, indicating a static count of pending pods during the observed window.

 sum(kube_pod_status_phase{phase="Pending"})

This PromQL query sum(kube_pod_status_phase{phase="Pending"}) calculates the total number of Kubernetes pods that are currently in the "Pending" phase across the entire cluster. The query filters for pods with the phase status of "Pending" and then sums them up.

Pod network bandwidth¶

This panel shows real-time pod-level network bandwidth consumption, focused on receive traffic (inflow). The query calculates the per-second rate of bytes received over the last 2-minute window, converted to bits per second by multiplying by 8. The panel displays data for specific pods (e.g., starrocks-in-share-data-mode), and includes metrics such as mean, max, and latest bandwidth. The inflow remains relatively low, with the highest pod reaching ~11.5 KiB/s.

 max(max(irate(container_network_receive_bytes_total{...}[2m])) by (pod)) * 8

This PromQL query calculates the maximum network receive bandwidth across all pods in a Kubernetes cluster. It works by first computing the instantaneous rate of network bytes received over a 2-minute window for each pod using irate(container_network_receive_bytes_total{...}[2m]), then finding the maximum value per pod with max(...) by (pod), before selecting the absolute maximum across all pods with the outer max(). Finally, it multiplies the result by 8 to convert from bytes per second to bits per second, which is the standard unit for network bandwidth measurement.

Cluster pod count¶

This panel provides the total number of active pods across the entire Kubernetes cluster by counting all available time series with the kube_pod_info metric. It includes pods in all phases and namespaces. The graph remains stable at around 392, which reflects a healthy and consistent pod footprint in the cluster without major deployment activity during the interval.

count(kube_pod_info)

This PromQL query count(kube_pod_info) counts the total number of active pods across the entire Kubernetes cluster. It works by counting all available time series with the kube_pod_info metric, which represents basic information about each pod. The query returns a single value showing the total pod count regardless of their status, namespace, or other attributes, providing a simple but effective overview of the cluster's overall pod footprint.

Cluster memory usage¶

This graph shows memory consumption (in GiB) across all namespaces, calculated from container_memory_working_set_bytes (the actual memory in use that cannot be reclaimed). It is aggregated by namespace and converted from bytes to GiB. The namespace public stands out with the highest usage of ~31 GiB, while others like metis and observability report considerably lower usage. This allows platform operators to identify high memory consumers in the cluster quickly.

 sum(container_memory_working_set_bytes) by (namespace) / (1024 * 1024 * 1024)

This PromQL query calculates the total memory usage per Kubernetes namespace by summing up the working set memory bytes across all containers and then dividing by 1,073,741,824 (1024³) to convert from bytes to gibibytes (GiB). The working set memory represents the actual memory in active use that cannot be reclaimed by the kernel, making it a good indicator of real memory consumption. By aggregating this metric by namespace, the query enables administrators to quickly identify which namespaces are consuming the most memory resources across the cluster.

Microservices overview¶

This section provides insights into container-level resource consumption and performance metrics across the Kubernetes cluster. It enables operations teams to identify specific microservices causing resource pressure, track container lifecycle states, and understand utilization patterns at a granular level. By monitoring containers directly, platform engineers can pinpoint application-specific issues that might be obscured in pod or node-level views, allowing for targeted optimization and more effective troubleshooting of application components.

Containers Running¶

This query sums up the number of containers that are currently in the running state across all namespaces. The resulting value 294 reflects the count of running containers in the cluster.

 sum(kube_pod_container_status_running{namespace=~".*"})

This PromQL query calculates the total number of running containers across all namespaces in the Kubernetes cluster. It works by counting all container instances where the status is "running" as reported by the kube_pod_container_status_running metric. The namespace=~".*" filter uses a regular expression to match all namespaces, ensuring a comprehensive count across the entire cluster. This metric is crucial for monitoring overall container health and availability within the platform.

Containers Waiting¶

This query counts containers in the waiting state. A waiting container is usually initializing, pulling images, or blocked due to resource limits. The panel shows 1 such a container at the time of capture.

 sum(kube_pod_container_status_waiting{namespace=~".*"})

This PromQL query sums the total number of containers across all namespaces that are currently in the "waiting" state. It uses the kube_pod_container_status_waiting metric and applies a regex pattern namespace=~".*" to match all namespaces. Containers in the waiting state are typically initializing, pulling images, or waiting due to resource constraints or other dependencies. This metric is important for identifying potential deployment issues or bottlenecks in the container lifecycle.

Containers Terminated¶

This returns the number of containers that have terminated, either because their process completed or due to a failure. In this case, the count is 198.

 sum(kube_pod_container_status_terminated{namespace=~".*"})

This PromQL query sum(kube_pod_container_status_terminated{namespace=~".*"}) calculates the total number of containers across all namespaces that are currently in the "terminated" state. The query uses the kube_pod_container_status_terminated metric and applies a regex pattern namespace=~".*" to match all namespaces. Containers enter the terminated state when they have completed execution or crashed. This metric is valuable for monitoring the container lifecycle and identifying potential issues with application stability.

Containers Restarts (Last 30 Minutes)¶

This tracks the delta (change) in the number of container restarts over the past 30 minutes, summed across all namespaces. A value of 0 indicates there were no container restarts in that period.

 sum(delta(kube_pod_container_status_restarts{namespace=~".*"}[30m]))

This PromQL query calculates the total number of container restarts across all namespaces in the Kubernetes cluster within the last 30 minutes. It works by using the delta() function to measure the change in the kube_pod_container_status_restarts metric over a 30-minute window, then aggregates these changes using sum(). The namespace filter namespace=~".*" ensures all namespaces are included in the calculation. This metric is particularly useful for detecting recent stability issues, as container restarts often indicate application crashes or resource constraints.

Microservices (Container Name) Resource Statistics¶

This panel presents a comprehensive overview of resource metrics for microservice containers across different namespaces. It includes values for CPU usage, CPU restrictions, CPU consumption, memory limits, and memory requests per container. The data is aggregated and normalized using multiple PromQL queries. Below is a detailed explanation of each column, along with the exact query used to compute the value.

Average CPU Usage %¶

This column reflects the normalized average CPU consumption as a percentage of the total CPU quota set for each container.

Query:

sum(irate(container_cpu_usage_seconds_total{origin_prometheus=~".*", container=~".*", container!="" ,container!="POD", namespace=~".*"}[2m])) by (container)
/
sum(container_spec_cpu_quota{origin_prometheus=~".*", container=~".*", container!="" ,container!="POD", namespace=~".*"}) by (container)

This calculates the average rate of CPU usage per container (in cores per second) over a 2-minute window, then normalizes it by dividing by the CPU quota of the container.

Total CPU Restrictions¶

This value indicates the cumulative CPU quota (in cores) imposed on the container using Kubernetes resource limits.

Query:

sum(kube_pod_container_resource_limits{origin_prometheus=~".*", resource="cpu", unit="core", container=~".*", container!="" ,container!="POD", namespace=~".*"}) by (container)

This query fetches the statically defined CPU limit from the container specifications, expressed in CPU cores.

Total CPU Consumed (Raw CPU Seconds)¶

This shows the cumulative raw CPU usage by the container, represented as total CPU seconds consumed over time.

Query:

sum(irate(container_cpu_usage_seconds_total{origin_prometheus=~".*", container=~".*", container!="" ,container!="POD", namespace=~".*"}[2m])) by (container)

This irate function computes the per-second rate of CPU usage over a 2-minute interval, aggregated per container.

Total Memory Limit¶

This reflects the maximum memory limit set on the container through Kubernetes resource specifications.

Query:

sum(kube_pod_container_resource_limits{origin_prometheus=~".*", resource="memory", unit="byte", container=~".*", container!="" ,container!="POD", namespace=~".*"}) by (container)

This query returns the hard upper memory bound configured for each container, in bytes.

Total CPU (CPL: CPU per Logical Unit)¶

This represents CPU usage in logical units, likely normalized or abstracted depending on internal policy (e.g., CPU per service or per tenant).

Query:

irate(container_cpu_usage_seconds_total{...}) * normalization_factor

This query calculates the instantaneous CPU usage rate over a short time window for containers, then multiplies it by a normalization factor to standardize CPU measurements across different container types or environments.

Total Memory Requirements¶

This represents the memory requested by containers during scheduling, which is the guaranteed memory Kubernetes reserves.

Query:

sum by (container) (
  kube_pod_container_resource_requests{origin_prometheus=~".*", resource="memory", unit="byte", container=~".*", container!="" ,container!="POD", namespace=~".*"}
)

This PromQL query sums up the memory resource requests for all containers across all namespaces, grouping the results by container name. It excludes the "POD" containers and empty container names while filtering for memory requests measured in bytes.

Microservices CPU usage %¶

Displays the CPU usage percentage of each container, normalized to 100% of their allocated CPU quota. This panel divides the per-second CPU usage (irate) by the CPU quota to calculate a percentage. The multiplication by 100 normalizes it to the 0–100% scale.

sum(irate(container_cpu_usage_seconds_total{origin_prometheus=~".*", container=~".*", container!="", container!="POD", namespace=~".*"}[2m])) by (container)
/
sum(container_spec_cpu_quota{origin_prometheus=~".*", container=~".*", container!="", container!="POD", namespace=~".*"}) by (container) * 100

This PromQL query calculates the average CPU usage of containers as a percentage of their allocated CPU quota. It divides the per-second CPU usage rate (measured over a 2-minute window using irate) by the container's CPU quota, then multiplies by 100 to express it as a percentage. The query filters for all containers across all namespaces while excluding empty container names and Kubernetes internal "POD" containers. The results are grouped by container name, allowing operators to quickly identify which microservices are approaching or exceeding their CPU limits.

Microservices memory usage %¶

Displays memory usage as a percentage of the memory limit across containers.

Query A – Working Set Based:

(container_spec_memory_limit_bytes{origin_prometheus=~"",container =~".*",container !="",container!="POD",namespace=~".*"}) by (container) * 100

Query B – RSS Based (Alternative):

sum (container_memory_rss{origin_prometheus=~"",container =~".*",container !="",container!="POD",namespace=~".*"}) by (container)/ sum(container_spec_memory_limit_bytes{origin_prometheus=~"",container =~".*",container !="",container!="POD",namespace=~".*"}) by (container) * 100

The first query uses working set memory, which is more reflective of memory actively in use. The second uses RSS, which includes memory not currently in use. Both are normalized against memory limits to show utilization percentage.

Microservices network bandwidth¶

Tracks the inbound and outbound network traffic per container over time.

Queries:

sum(sum(irate(container_network_receive_bytes_total{origin_prometheus=~"",node=~"^.*$",namespace=~".*"}[2m])) by (pod)* on(pod) group_right kube_pod_container_info{origin_prometheus=~"",namespace=~".*",container =~".*"}) by(container)

sum(sum(irate(container_network_transmit_bytes_total{origin_prometheus=~"",node=~"^.*$",namespace=~".*"}[2m])) by (pod)* on(pod) group_right kube_pod_container_info{origin_prometheus=~"",namespace=~".*",container =~".*"}) by(container) *8

sum (rate (container_network_receive_bytes_total{origin_prometheus=~"",pod=~".*",image!="",name=~"^k8s_.*",node=~"^.*$",namespace=~".*",pod=~".*.*.*"}[2m])) by (pod)

- sum (rate (container_network_transmit_bytes_total{origin_prometheus=~"",pod=~".*",image!="",name=~"^k8s_.*",node=~"^.*$",namespace=~".*",pod=~".*.*.*"}[2m])) by (pod)

These PromQL queries measure network bandwidth usage for containers in a Kubernetes cluster by:

The first query aggregates inbound network traffic (received bytes) per container
The second query aggregates outbound network traffic (transmitted bytes) per container and multiplies by 8 (likely to convert to bits)
The third and fourth queries calculate net bandwidth (receive minus transmit rates) for specific pod patterns

Microservices CPU cores used¶

This panel displays total CPU cores used over time per microservice. It includes both current usage and resource limits.

CPU Usage (irate-based usage calculation):

sum(kube_pod_container_resource_limits{origin_prometheus=~"",resource="cpu", unit="core",container =~".*",container !="",container!="POD",namespace=~".*"}) by (container)

This PromQL query sums all CPU resource limits defined for containers across all namespaces in a Kubernetes cluster. It retrieves the kube_pod_container_resource_limits metric specifically for CPU resources measured in core units, then aggregates these values by container name. The query filters out empty containers and the Kubernetes internal "POD" containers while including all namespaces. The result provides a comprehensive view of the total CPU cores allocated as limits for each microservice container in the cluster.

CPU Limit (for comparison):

sum(irate(container_cpu_usage_seconds_total{origin_prometheus=~"",container =~".*",container !="",container!="POD",namespace=~".*"}[2m])) by (container)

This PromQL query calculates the total CPU usage per container across all namespaces in a Kubernetes cluster. It works by summing the instantaneous CPU usage rate (using the irate function over a 2-minute window) for each container, excluding empty containers and Kubernetes internal "POD" containers. The result is expressed in CPU cores per second, showing the actual computational resources being consumed by each microservice container at the time of measurement.

Microservices memory usage¶

This panel compares actual memory usage vs memory limits for each container.

Memory Usage (Working Set):

sum(container_memory_working_set_bytes{origin_prometheus=~".*", container =~".*", container!="" ,container!="POD", namespace=~".*"}) by (container)

This PromQL query calculates the total working set memory usage across all containers in the Kubernetes cluster, grouped by container name. It sums the container_memory_working_set_bytes metric, which represents the amount of memory actively being used by each container's processes. The query filters to include all containers except empty ones and the Kubernetes internal “POD" containers, spanning all namespaces. Working set memory is particularly important as it represents memory that cannot be reclaimed by the system without causing performance degradation, making it a critical metric for monitoring container memory health.

Memory Limit:

sum(container_spec_memory_limit_bytes{origin_prometheus=~".*", container =~".*", container!="" ,container!="POD", namespace=~".*"}) by (container)

This PromQL query sums up all memory limits defined for containers across the Kubernetes cluster, grouped by container name. It retrieves the container_spec_memory_limit_bytes metric, which represents the maximum memory allocation allowed for each container as specified in their resource configurations. The query filters to include all containers while excluding empty containers and the Kubernetes internal "POD" containers, spanning all namespaces. This metric is crucial for capacity planning and for calculating memory utilization percentages when compared against actual memory usage.

Microservices pod count¶

This panel counts the number of running pods per container within each namespace.

count(kube_pod_container_info{origin_prometheus=~".*", container =~".*", container!="" ,container!="POD", namespace=~".*"}) by (container, namespace)

This PromQL query calculates the rate of change in node startup duration times over a 5-minute window. It uses the rate() function applied to the kubelet_node_startup_duration_seconds A metric that measures the time it takes for nodes to initialize. This metric is valuable for identifying trends in node provisioning efficiency and detecting potential issues in the node bootstrap process. A consistently increasing rate could indicate degrading infrastructure performance, while flat or decreasing rates suggest stable or improving node startup times.