Alerts for High Memory Usage¶
Whenever a DataOS Resource is created, it runs as a single pod in the underlying Kubernetes cluster. These pods inherit resource definitions such as CPU and memory limits or requests from the Resource configurations. Observability in this context is applied at the pod level by monitoring pod metrics, considering a single pod per Resource, which effectively helps in understanding the behavior and performance of the higher-level DataOS Resources that generated them.
Memory limit breach¶
This section outlines how to configure both a Monitor Resource to observe memory usage at the pod level and a Pager Resource to send alerts when usage crosses a defined percentage of the podโs total memory limit. This alert helps detect when a pod is approaching its maximum memory allocation, which could lead to OOM (Out Of Memory) kills or degraded performance.
-
Create a Monitor Resource manifest file. This manifest compares the total memory usage of the pod against its defined memory limit. An incident is triggered if usage exceeds 80% of the limit.
name: pod-memory-limit-monitor description: Monitor for memory usage of the pod perspectivedb-rest-yx2r-d-5b7bdb5648-p9tdr version: v1alpha type: monitor monitor: schedule: '*/2 * * * *' type: equation_monitor equation: leftExpression: queryCoefficient: 1 queryConstant: 0 query: type: prom ql: > sum by(pod) ( container_memory_usage_bytes{pod="perspectivedb-rest-yx2r-d-5b7bdb5648-p9tdr"} ) / sum by(pod) ( kube_pod_container_resource_limits{pod="perspectivedb-rest-yx2r-d-5b7bdb5648-p9tdr", resource="memory"} ) # replace the pod name with actual pod name of the Resource rightExpression: queryCoefficient: 1 queryConstant: 0.8 query: type: prom ql: 'scalar(1)' operator: greater_than incident: type: prom name: memory-limit-violation category: usage severity: high -
Validate the incident condition by executing the command below.
Expected output when the condition is not met:
The podโs memory usage is below 80% of the defined limit. No incident is triggered.
INFO[0000] ๐ฎ develop observability... INFO[0000] ๐ฎ develop observability...monitor tcp-stream...starting INFO[0001] ๐ฎ develop observability...monitor tcp-stream...running INFO[0002] ๐ฎ develop observability...monitor tcp-stream...stopping INFO[0002] ๐ฎ context cancelled, monitor tcp-stream is closing. INFO[0003] ๐ฎ develop observability...complete RESULT (maxRows: 10, totalRows:0): ๐ง monitor condition not metExpected output when the condition is met:
The pod's memory usage exceeds 80% of its limit. The monitor triggers an incident.
INFO[0000] ๐ฎ develop observability... INFO[0000] ๐ฎ develop observability...monitor tcp-stream...starting INFO[0001] ๐ฎ develop observability...monitor tcp-stream...running INFO[0001] ๐ฎ develop observability...monitor tcp-stream...stopping INFO[0001] ๐ฎ context cancelled, monitor tcp-stream is closing. INFO[0002] ๐ฎ develop observability...complete RESULT (maxRows: 10, totalRows:1): ๐ฉ monitor condition met -
Apply the Monitor using the command below.
-
Verify the Monitor runtime to ensure itโs registered and running properly.
-
Create a Pager Resource manifest file. This configures where and how alerts should be delivered when memory usage crosses the threshold.
name: pod-memory-limit-pager version: v1alpha type: pager description: Pager to alert when pod memory usage exceeds 80% of its defined memory limit workspace: <your-workspace> pager: conditions: - valueJqFilter: .properties.name operator: equals value: memory-limit-violation output: msTeams: webHookUrl: https://rubikdatasolutions.webhook.office.com/webhookb2/IncomingWebhook/92dcd2 email: emailTargets: - iamgroot@tmdc.io -
Apply the Pager Resource using the command below.
-
Get notified! When the memory usage crosses the threshold, the incident is triggered, and the Pager sends the notification to the configured destination.
Usage exceeds request¶
This section outlines how to configure a Monitor Resource to observe pod memory usage relative to its defined memory requests, and a Pager Resource to send alerts when usage crosses a defined percentage. This is useful for detecting pods that are exceeding their expected baseline allocation and may need to be scaled or reconfigured.
-
Create a Monitor Resource manifest file
This manifest compares the total memory usage of the pod against the requested memory values. An incident is triggered if memory usage exceeds 80% of the total requested memory.
yaml CopyEdit name: pod-memory-request-monitor description: Monitor for memory usage of the pod perspectivedb-rest-yx2r-d-5b7bdb5648-p9tdr version: v1alpha type: monitor monitor: schedule: '*/2 * * * *' type: equation_monitor equation: leftExpression: queryCoefficient: 1 queryConstant: 0 query: type: prom ql: > sum by(pod) ( container_memory_usage_bytes{pod="perspectivedb-rest-yx2r-d-5b7bdb5648-p9tdr"} ) / sum by(pod) ( kube_pod_container_resource_requests{pod="perspectivedb-rest-yx2r-d-5b7bdb5648-p9tdr", resource="memory"} ) rightExpression: queryCoefficient: 1 queryConstant: 0.8 query: type: prom ql: 'scalar(1)' operator: greater_than incident: type: prom name: memory-request-violation category: usage severity: warning -
Validate the incident condition, run the following command to test whether the condition works before applying it.
dataos-ctl develop observability monitor equation -f /office/monitor/pod_memory_request_monitor.yamlExpected result when the condition is NOT met:
INFO[0000] ๐ฎ develop observability... INFO[0000] ๐ฎ develop observability...monitor tcp-stream...starting INFO[0001] ๐ฎ develop observability...monitor tcp-stream...running INFO[0002] ๐ฎ develop observability...monitor tcp-stream...stopping INFO[0002] ๐ฎ context cancelled, monitor tcp-stream is closing. INFO[0003] ๐ฎ develop observability...complete RESULT (maxRows: 10, totalRows:0): ๐ง monitor condition not metExpected result when the condition is met:
INFO[0000] ๐ฎ develop observability... INFO[0000] ๐ฎ develop observability...monitor tcp-stream...starting INFO[0001] ๐ฎ develop observability...monitor tcp-stream...running INFO[0001] ๐ฎ develop observability...monitor tcp-stream...stopping INFO[0001] ๐ฎ context cancelled, monitor tcp-stream is closing. INFO[0002] ๐ฎ develop observability...complete RESULT (maxRows: 10, totalRows:1): ๐ฉ monitor condition met -
Apply the Monitor Resource.
-
Verify the Monitor runtime.
-
Create a Pager Resource manifest.
This Pager will listen for the memory request incident and send an alert to a Teams channel webhook and Outlook.
name: pod-memory-request-pager version: v1alpha type: pager description: Pager to alert when pod memory usage exceeds requested memory workspace: <your-workspace> pager: conditions: - valueJqFilter: .properties.name operator: equals value: memory-request-violation msTeams: webHookUrl: https://rubikdatasolutions.webhook.office.com/webhookb2/09239cd8-9d59-9621-9217305bf6e22bdde-3ec2-4392-78e9f35a44fb/IncomingWebhook/92dcd2acdaee4e6cac125ac4a729e48f/631bd149-c89d-4d3b-8979-8e364b419/V23AwNxCZx9fToWpqDSYeRkQefDZ-cPn74pY60 email: emailTargets: - iamgroot@tmdc.io -
Apply the Pager Resource.
-
Get notified once memory usage exceeds 80% of the requested value, an incident is triggered, and an alert is sent to your configured destination.