Alerts for High CPU Usage¶

Whenever a DataOS Resource is created, it runs as a single pod in the underlying Kubernetes cluster. These pods inherit resource definitions such as CPU and memory limits or requests from the Resource configurations. Observability in this context is applied at the pod level by monitoring pod metrics, considering a single pod per Resource, and this effectively helps in understanding the behavior and performance of the higher-level DataOS Resources that generated them.

CPU limit breach¶

This section outlines how to configure both a Monitor Resource to observe the incident condition and a Pager Resource to send alerts when the condition is met. This type of alert is useful when CPU limits are configured for the Resource. Exceeding the CPU limit can lead to throttling, affecting the Resource's performance. This alert helps in proactively detecting such behavior.

Execute the following command in DataOS CLI to get the pod name corresponding to the Resource that needs to be monitored.

dataos-ctl log -t ${{resource-type}} -n ${{resource-name}}

Example usage:

dataos-ctl log -t service -n perspectivedb-rest
# output
INFO[0000] 📃 log(public)...                             
INFO[0001] 📃 log(public)...complete                     

    NODE NAME    │ CONTAINER NAME │ ERROR  
─────────────────┼────────────────┼────────
  perspectivedb-rest-yx2r-d-5b7bdb5648-p9tdr │ perspectivedb-rest     │        
 # ^ pod name
-------------------LOGS-------------------
    Task executor: pool=0, active=0, queue=0
    Concurrency control: slots=4, available=3
    Reservations:
        (pending)
Query tasks:

2025-06-03T09:01:32.289Z    INFO    Notification Thread io.airlift.stats.JmxGcMonitor   Major GC: application 275994ms, stopped 61ms: 374.09MB -> 322.10MB
2025-06-03T09:01:58.045Z    DEBUG   task-executor-scheduler-0   io.trino.execution.executor.dedicated.ThreadPerDriverTaskExecutor   
Queue:
    Baseline weight: 0
    Groups:
    Task executor: pool=0, active=0, queue=0
    Concurrency control: slots=4, available=3
    Reservations:
        (pending)

Create a Monitor Resource manifest file as example below and replace the pod name with the pod name of the Resource which you want to monitor. This manifest defines the logic for comparing actual CPU usage against the pod's total CPU limit. The incident will be triggered if usage exceeds 80% of the limit.

name: cpu-monitor
description: Monitor for CPU usage of the perspectivedb-rest container
version: v1alpha
type: monitor
monitor:
  schedule: '*/2 * * * *'
  type: equation_monitor
  equation:
    leftExpression:
      queryCoefficient: 1
      queryConstant: 0     
      query:
        type: prom
        ql: '100 * (sum by(pod) (rate(container_cpu_usage_seconds_total{pod="perspectivedb-rest-yx2r-d-5b7bdb5648-p9tdr"}[5m])))/sum by(pod) (kube_pod_container_resource_limits{pod="perspectivedb-rest-yx2r-d-5b7bdb5648-p9tdr", resource="cpu"})'

    rightExpression:
      queryCoefficient: 0
      queryConstant: 80
    operator: greater_than
  incident:
    type: prom
    name: cpualerts
    category: equation
    severity: info
    operator: greater_than

🗣️ Before applying the Monitor manifest, make sure to replace the placeholder pod name (perspectivedb-rest-yx2r-d-5b7bdb5648-p9tdr) in the PromQL query with the actual pod name of the Resource you want to monitor.

Validate the incident condition if it is configured correctly by executing the command below.

dataos-ctl develop observability monitor equation -f /office/monitor/pod_cpu_limit_monitor.yaml

Expected output when the condition is not met:

The monitor ran successfully, but CPU usage was below 80% of the pod's limit. No incident is triggered.

bash
CopyEdit
INFO[0000] 🔮 develop observability...
INFO[0000] 🔮 develop observability...monitor tcp-stream...starting
INFO[0001] 🔮 develop observability...monitor tcp-stream...running
INFO[0002] 🔮 develop observability...monitor tcp-stream...stopping
INFO[0002] 🔮 context cancelled, monitor tcp-stream is closing.
INFO[0003] 🔮 develop observability...complete

RESULT (maxRows: 10, totalRows:0): 🟧 monitor condition not met

Expected output when the condition is met:

The pod's CPU usage exceeded 80% of its CPU limit. The monitor triggered an incident.

bash
CopyEdit
INFO[0000] 🔮 develop observability...
INFO[0000] 🔮 develop observability...monitor tcp-stream...starting
INFO[0001] 🔮 develop observability...monitor tcp-stream...running
INFO[0001] 🔮 develop observability...monitor tcp-stream...stopping
INFO[0001] 🔮 context cancelled, monitor tcp-stream is closing.
INFO[0002] 🔮 develop observability...complete

RESULT (maxRows: 10, totalRows:1): 🟩 monitor condition met

Run the following command to apply the Monitor.

dataos-ctl resource apply -f pod-cpu-limit-monitor.yaml

Verify the Monitor runtime. This step ensures that the Monitor has been successfully registered and is running as expected.
```
dataos-ctl get runtime -t monitor -w <your-workspace> -n pod-cpu-limit-monitor -r
```

Create a Pager Resource manifest file. This manifest configures the alert delivery path. It listens for the incident triggered by the Monitor and sends the alert to a Teams channel or any other webhook.

name: pod-cpu-limit-pager
version: v1alpha
type: pager
description: Pager to alert when pod CPU usage exceeds 80% of its defined CPU limit
workspace: <your-workspace>
pager:
  conditions:
    - valueJqFilter: .properties.name
      operator: equals
      value: cpu-limit-violation
  output:
    msTeams:
      webHookUrl: https://rubikdatasolutions.webhook.office.com/webhookb2/09239cd8-9d59-9621-9217305bf6e22bdde-3ec2-4392-78e9f35a44fb/IncomingWebhook/92dcd2acdaee4e6cac125ac4a729e48f/631bd149-c89d-4d3b-8979-8e364b419/V23AwNxCZx9fToWpqDSYeRkQefDZ-cPn74pY60
    email:
      emailTargets:
        - iamgroot@tmdc.io

Once defined, apply the Pager Resource using the command below.
```
dataos-ctl resource apply -f pod-cpu-limit-pager.yaml
```
Get notified! When the CPU usage condition is met, the incident is triggered, and the Pager sends the notification to the configured destination.

Usage exceeds request¶

This section outlines how to configure both a Monitor Resource to observe the incident condition and a Pager Resource to send alerts when the condition is met. This alert is useful when a pod consumes more CPU than what was originally requested, indicating possible scheduling pressure or a need to adjust resource allocation.

Create a Monitor Resource manifest file. This manifest defines the logic for comparing the total CPU usage of the pod against its total requested CPU. The incident will be triggered if usage exceeds 80% of the requested amount.

name: cpu-monitor
description: Monitor for CPU usage of the nilus-server container
version: v1alpha
type: monitor
monitor:
  schedule: '*/2 * * * *'
  type: equation_monitor
  equation:
    leftExpression:
      queryCoefficient: 1
      queryConstant: 0     
      query:
        type: prom
        ql: '100 * (sum by(pod) (rate(container_cpu_usage_seconds_total{pod="perspectivedb-rest-yx2r-d-5b7bdb5648-p9tdr"}[5m])))/sum by(pod) (kube_pod_container_resource_requests{pod="perspectivedb-rest-yx2r-d-5b7bdb5648-p9tdr", resource="cpu"})'

    rightExpression:
      queryCoefficient: 0
      queryConstant: 80
    operator: greater_than
  incident:
    type: prom
    name: cpualerts
    category: equation
    severity: info
    operator: greater_than

🗣️ Before applying the Monitor manifest, make sure to replace the placeholder pod name (perspectivedb-rest-yx2r-d-5b7bdb5648-p9tdr) in the PromQL query with the actual pod name of the Resource you want to monitor.

Validate the incident condition if it is configured correctly by executing the command below.

dataos-ctl develop observability monitor equation -f /office/monitor/pod_cpu_request_monitor.yaml

Expected output when the condition is not met:

The monitor runs successfully, but the CPU usage is below 80% of the pod’s request. No incident is triggered.

INFO[0000] 🔮 develop observability...
INFO[0000] 🔮 develop observability...monitor tcp-stream...starting
INFO[0001] 🔮 develop observability...monitor tcp-stream...running
INFO[0002] 🔮 develop observability...monitor tcp-stream...stopping
INFO[0002] 🔮 context cancelled, monitor tcp-stream is closing.
INFO[0003] 🔮 develop observability...complete

RESULT (maxRows: 10, totalRows:0): 🟧 monitor condition not met

Expected output when the condition is met:

CPU usage crossed 80% of the total CPU requested for the pod. An incident is triggered.

INFO[0000] 🔮 develop observability...
INFO[0000] 🔮 develop observability...monitor tcp-stream...starting
INFO[0001] 🔮 develop observability...monitor tcp-stream...running
INFO[0001] 🔮 develop observability...monitor tcp-stream...stopping
INFO[0001] 🔮 context cancelled, monitor tcp-stream is closing.
INFO[0002] 🔮 develop observability...complete

RESULT (maxRows: 10, totalRows:1): 🟩 monitor condition met

Run the following command to apply the Monitor.

dataos-ctl resource apply -f pod-cpu-request-monitor.yaml

Verify the Monitor runtime. This step ensures that the Monitor has been successfully registered and is running as expected.
```
dataos-ctl get runtime -t monitor -w <your-workspace> -n pod-cpu-request-monitor -r
```

Create a Pager Resource manifest file. This manifest configures the alert delivery path. It listens for the incident triggered by the Monitor and sends the alert to a Teams channel or any other webhook.

name: pod-cpu-request-pager
version: v1alpha
type: pager
description: Pager to alert when pod CPU usage exceeds 80% of its requested CPU
workspace: <your-workspace>
pager:
  conditions:
    - valueJqFilter: .properties.name
      operator: equals
      value: cpu-request-violation
  output:
    msTeams:
      webHookUrl: https://rubikdatasolutions.webhook.office.com/webhookb2/09239cd8-9d59-9621-9217305bf6e22bdde-3ec2-4392-78e9f35a44fb/IncomingWebhook/92dcd2acdaee4e6cac125ac4a729e48f/631bd149-c89d-4d3b-8979-8e364b419/V23AwNxCZx9fToWpqDSYeRkQefDZ-cPn74pY60
    email:
      emailTargets:
        - iamgroot@tmdc.io

Once defined, apply the Pager Resource using the command below.
```
dataos-ctl resource apply -f pod-cpu-request-pager.yaml
```
Get notified! When the CPU usage condition is met, the incident is triggered, and the Pager sends the notification to the configured destination.