Observability Use Case: Resolving CPU Limit Breach on Cluster Resource¶

This document describes a real-world example of how users handle alerts in DataOS. It walks through each step from getting an alert to fixing the issue. The guide shows how the observability tools work together to help users find and solve issues quickly and effectively.

Teams alert¶

An observability Monitor in DataOS detected that a Cluster’s CPU usage had surpassed a critical threshold, automatically triggering an Incident. The Cluster is the part of a Data Product powering a business use case. DataOS pairs the Monitors with a Pager Resource to route alerts to external channels. In this case, the Pager was configured with a Microsoft Teams webhook, causing an instant alert notification in the team’s channel. The MS Teams message contained key details (monitor name, severity “high”, timestamp, and description), informing the operators that the Cluster’s CPU limit was breached. This seamless integration of DataOS Monitor and Pager ensured the team received real-time notification of the issue via Microsoft Teams, enabling a quick response.

Teams alert notification for CPU breach

The user had configured the alerts using: a Monitor Resource that tracked CPU utilization on a Cluster, and a Pager Resource that sent alerts through Microsoft Teams when the Monitor detected an issue. Below is how these were configured and worked together to trigger the alert.

Configuration¶

Prior to this incident, the user had set up a Monitor manifest to watch the cluster’s CPU utilization.

The monitor was scheduled to run periodically (e.g., every 2 minutes) and evaluate a Prometheus query against DataOS’s metrics store.
In the manifest, the left-hand expression was defined as a PromQL query (type: prom) targeting the “thanos” cluster (DataOS’s internal metrics endpoint).
This query measured the cluster’s current CPU usage (for example, as a percentage of its limit or in millicores). The right-hand side was a constant threshold (e.g., 80% or an equivalent value), and the operator was greater_than, so the condition would trigger if CPU usage stayed above the threshold.
Alongside the Monitor, the user configured a Pager resource to forward the incident as an alert to Microsoft Teams. The Pager’s manifest specified an output section with an MS Teams webhook URL.
This tells DataOS to send any incident caught by this Pager to the given Teams channel. (Optional filtering rules can be set in the Pager via conditions, such as matching the incident name or severity, ensuring only relevant incidents trigger the alert) With this setup in place, whenever the CPU monitor’s incident fired, the Pager would post a formatted alert message into Teams.
The alert message included details like the monitor name and description, severity level, and timestamp.
In this scenario, as soon as the Cluster’s CPU usage hit the threshold for the configured duration, the Monitor created an incident, which the Pager caught. An operator saw a Teams notification (e.g., “Alert: cpu-monitor01 triggered with severity info at 07:26:00. Description: Cluster CPU usage above 80%...”). This immediate alert gave the team a heads-up about potential Resource degradation.

In practice, the ql, PromQL would be formulated to retrieve the CPU usage metric for the specific Cluster (e.g., using labels or an ID for that Cluster). When the average CPU usage exceeded 80%, the Monitor would mark the condition as met and create an incident.

Steps¶

Executed the following command in DataOS CLI to get the pod name corresponding to the Cluster that needs to be monitored.

dataos-ctl log -t monitor -n mycluster -w public

Output:

dataos-ctl log -t cluster -w public -n mycluster
INFO[0000] 📃 log(public)...                             
INFO[0001] 📃 log(public)...complete                     

    NODE NAME    │ CONTAINER NAME │ ERROR  
─────────────────┼────────────────┼────────
  mycluster-ss-0 │ mycluster      │  
  # ^ pod name      

-------------------LOGS-------------------
===========
Configuring core
 - Setting io.compression.codecs=org.apache.hadoop.io.compress.SnappyCodec
>> /usr/trino/etc/catalog/core-site.xml
<configuration>
<property><name>io.compression.codecs</name><value>org.apache.hadoop.io.compress.SnappyCodec</value></property>
</configuration>\n-----------
JVM Options: /usr/trino/etc/jvm.config
-server
-Xmx128G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-Djdk.attach.allowAttachSelf=true
--enable-native-access=ALL-UNNAMED
Config: /usr/trino/etc/config.properties

Created a Monitor Resource manifest file as shown below. This manifest defines the logic for comparing actual CPU usage against the pod's total CPU limit. The incident will be triggered if usage exceeds 80% of the limit.

name: cpu-monitor01
description: Monitor for CPU usage of the perspectivedb-rest container
version: v1alpha
type: monitor
monitor:
  schedule: '*/2 * * * *'
  type: equation_monitor
  equation:
    leftExpression:
      queryCoefficient: 1
      queryConstant: 0     
      query:
        type: prom
        ql: '100 * (sum by(pod) (rate(container_cpu_usage_seconds_total{pod="mycluster-ss-0"}[5m])))/sum by(pod) (kube_pod_container_resource_limits{pod="mycluster-ss-0", resource="cpu"})'

    rightExpression:
      queryCoefficient: 0
      queryConstant: 80
    operator: greater_than
  incident:
    type: prom
    name: cpualerts
    category: equation
    severity: info
    operator: greater_than

Validate the incident condition if it is configured correctly by executing the command below.

dataos-ctl develop observability monitor equation -f /office/monitor/pod_cpu_limit_monitor.yaml

output:

The Monitor ran successfully, but CPU usage was below 80% of the pod's limit. No incident is triggered at that time.

bash
CopyEdit
INFO[0000] 🔮 develop observability...
INFO[0000] 🔮 develop observability...monitor tcp-stream...starting
INFO[0001] 🔮 develop observability...monitor tcp-stream...running
INFO[0002] 🔮 develop observability...monitor tcp-stream...stopping
INFO[0002] 🔮 context cancelled, monitor tcp-stream is closing.
INFO[0003] 🔮 develop observability...complete

RESULT (maxRows: 10, totalRows:0): 🟧 monitor condition not met

Executed the following command to apply the Monitor.

dataos-ctl resource apply -f pod-cpu-limit-monitor.yaml

Verified the Monitor runtime. This step ensures that the Monitor has been successfully registered and is running as expected.
```
dataos-ctl get runtime -t monitor -w public -n cpu-monitor01 -r
```

Created a Pager Resource manifest file. This manifest configures the alert delivery path. It listens for the incident triggered by the Monitor and sends the alert to a Teams channel or any other webhook.

name: pod-cpu-limit-pager
version: v1alpha
type: pager
description: Pager to alert when pod CPU usage exceeds 80% of its defined CPU limit
workspace: public
pager:
  conditions:
    - valueJqFilter: .properties.name
      operator: equals
      value: cpu-limit-violation
  output:
    webHook:
      url: {{MS_TEAMS_WEBHOOK_URL}}
      verb: post
      headers:
        'content-type': 'application/json'
      bodyTemplate: |
        {
          "title": "Pod CPU Usage Alert",
          "text": "CPU usage exceeded 80% of the defined CPU limit for pod mycluster-ss-0": {
            "Pod": "{{ .Labels.pod }}",
            "Time": "{{ .CreateTime }}",
            "Severity": "{{ .Properties.severity }}"
          }
        }

Once defined, apply the Pager Resource using the command below.

dataos-ctl resource apply -f /home/pod-cpu-limit-pager.yaml

Get notified! When the CPU usage condition is met, the incident is triggered, and the Pager sends the notification to the configured destination.

Teams alert delivered by Pager

Investigate in Operations¶

Upon receiving the alert, the user turned to the Operations App to check on the CPU usage of the Cluster named ‘mycluster01’. They focused on CPU usage vs. CPU limit for the affected Cluster. Operations App displayed a table for the Cluster’s CPU consumption, confirming the alert’s details: the CPU usage was more the 80% of the configured limit.

Operations: Pod Runtime Node details

This made it clear that the Cluster’s CPU was fully saturated, the usage metric would reach the value of the limit, and could not go any higher. This evidence pointed to the Cluster being CPU-starved. This aligned with the alert: the CPU usage remained above the 80% threshold consistently, triggering the incident.

Root cause: CPU limit¶

The root cause became evident: the Cluster’s CPU limit was set too low for its workload, causing it to be maxed out under load. In DataOS, a Cluster Resource is backed by a certain amount of compute capacity (CPU and memory). In this case, the Cluster was configured (via its YAML) with a fixed CPU limit. During normal operation, this might have been sufficient, but at the time of the incident, concurrent heavy queries or an unexpected spike in workload demanded more processing power than the defined limit. Since the container was capped at its CPU limit, it could not utilize more CPU, even though the host had free capacity. “Containers cannot use more CPU than the configured limit.”

In essence, the cluster was under-provisioned for the workload: it ran out of CPU headroom. The observability data ruled out other causes (no memory bottleneck or errors were observed) and pinpointed CPU exhaustion as the culprit. The high CPU usage incident was not a one-time glitch but a symptom of sustained load exceeding capacity. This analysis highlighted that the CPU limit configured for the cluster was too low, and the cluster needed more CPU to handle the demand.

Remediation: Raise CPU limit¶

To fix the issue, the team adjusted the Cluster’s resource allocation. The user updated the Cluster’s YAML manifest to increase its CPU limits (and corresponding requests). They updated the CPU values to grant more CPU. For instance, increasing the CPU limit from 30m to 300m. The new configuration would be:

# Resource meta section
name: mycluster01
version: v1
type: cluster
description: testing 
tags:
  - cluster

# Cluster-specific section
cluster:
  compute: query-default
  type: minerva

  # Minerva-specific section
  minerva:
    replicas: 1
    resources:
      limits:  # updated CPU usage limit
        cpu: 300m  
        memory: 600Mi
      requests:
        cpu: 50m
        memory: 100Mi
    debug:
      logLevel: INFO
      trinoLogLevel: ERROR
    depots:
      - address: dataos://stredshiftexternal

This change effectively allocates a higher CPU usage limit, giving it more processing capacity.

After updating the manifest, the user re-applied it to update the running Cluster. Using the DataOS CLI, by executing the below command:

dataos-ctl resource apply -f /home/mycluster.yaml -w public

Output:

INFO[0000] 🛠 apply...                                   
INFO[0000] 🔧 applying(public) mycluster01:v1:cluster... 
INFO[0001] 🔧 applying(public) mycluster01:v1:cluster...updated 
INFO[0001] 🛠 apply...complete

This command pushes the new configuration to DataOS, updating the Cluster resource with the higher CPU limit (in Kubernetes terms, this would update the pod’s resource limits). The platform then provisioned the additional CPU for the Cluster. Once this step is complete, the Cluster resource gets updated with the new CPU allocation as shown below.

Operations: Cluster runtime after raising CPU limit

Conclusion¶

This scenario highlights a meaningful use of DataOS observability. The Monitor and Pager alerting system gave early warning of a resource saturation issue, allowing the team to respond before a complete outage occurred. By leveraging DataOS’s observability features, the user ensured the platform’s reliability and performance were maintained. In summary, the Monitor/Pager combo identified a CPU bottleneck, guided the remedial action, and verified the outcome, exemplifying how DataOS observability empowers incident management and system tuning. The Cluster was restored swiftly, and stakeholders were confident knowing that the monitoring in place would catch similar issues in the future, demonstrating the value of observability in DataOS.