Cluster management¶

This topic provides an overview of Clusters in DataOS, their role in optimizing data engineering and analytics tasks, and how they utilize computational resources through query engines like Themis and Minerva. It covers creating Clusters with supported query engine types for efficient data processing and resource management.

Scenario¶

Imagine that your team runs complex queries, perform analytics on Snowflake data, so you need a Cluster that references the Snowflake Depot. The cluster should scale efficiently. Using DataOS Clusters, you choose between Themis (elastic, scalable) and Minerva (static, predictable). You decide on Themis for its flexibility, configure a manifest file, and apply it via the DataOS CLI.

Once active, your team starts querying directly from the Workbench. Themis dynamically scales with your workload, ensuring fast, efficient queries and resource optimization. Mission accomplished!

Quick concepts¶

In DataOS, a Cluster is a Resource that combines the necessary computational resources and configurations for various data engineering and analytics tasks. It utilizes the Compute Resource for processing power.

In a Cluster, multiple provisioned machines work together to handle and execute queries efficiently. These machines collaborate through a query engine, which distributes tasks among them, optimizing performance and resource usage.

Types of query engines supported by a Cluster¶

DataOS supports two types of query engines, each designed to manage and process incoming queries by coordinating the workload across the machines within the Cluster. This collaboration enables the system to manage large datasets and complex queries with improved speed and reliability, leveraging the combined processing power of all machines in the Cluster.

Themis¶

Themis is an elastic, distributed SQL query engine optimized for fast querying across large datasets. It dynamically adjusts resources based on workload demands, making it ideal for high-concurrency and scalable enterprise environments. Themis supports a wider range of data sources and offers advanced security features, making it suitable for dynamic, large-scale applications.

Minerva¶

Minerva is a static, high-performance query engine designed for analytical and exploratory workloads. It allows seamless querying across various data sources using a unified SQL interface. Minerva Clusters handle memory, I/O, CPU, and long-running queries efficiently, making it ideal for predictable data analysis on large datasets.

Prerequisites¶

To begin creating a Cluster, ensure the following prerequisites are met:

You must have the specific tags or use cases assigned to create a Cluster.
Ensure the Depot, which you are going to target is active.
Decide on which query engine is suitable for your use case.
Make sure you have already configured the Compute Resource.

Creating a Themis Cluster¶

Follow the steps given below.

Configure a manifest file containing the following code and update the details accordingly.

# Resource meta section (1)
name: themiscluster
version: v1
type: cluster
description: We are using this cluster to check the monitor and pager stuff with the help of themis cluster. 
tags:
  - cluster
  - themis

# Cluster-specific section (2)
cluster:
  compute: query-default
  type: themis
# Themis-specific section (3)
  themis:
    depots:
      - address: dataos://icebase

Attribute	Description
`name`	The name of the resource. In this case, it is `themiscluster`, representing the name of the cluster resource.
`version`	Specifies the version of the resource. Here it is `v1`, indicating that this is the first version of the cluster.
`type`	Defines the type of resource. The value `cluster` indicates that this is a cluster resource.
`description`	Provides a brief description of the cluster, stating it is used for monitoring and pager tasks with the help of the Themis cluster.
`tags`	A list of tags used for categorization. The tags `cluster` and `themis` classify the resource as both a cluster and related to Themis.
`compute`	Defines the compute environment for the cluster. Here, `query-default` indicates the default configuration for processing queries.
`type`	Specifies the type of the cluster. The value `themis` indicates that the cluster is using Themis query engine.
`depots`	A list of depots (storage locations) associated with the cluster.
`address`	Specifies the location of the depot. In this case, it points to `dataos://icebase.`

Apply the Cluster manifest by executing the following command in the terminal.
```
dataos-ctl apply -f /home/cluster/cluster.yaml
```

Verify the Cluster by executing the dataos-ctl get -t cluster -w public -r command in the terminal.

       NAME       | VERSION |  TYPE   | WORKSPACE | STATUS |  RUNTIME  |    OWNER     
------------------|---------|---------|-----------|--------|-----------|--------------
  themiscluster01 | v1      | cluster | public    | active | running:1 | iamgroot

After the successful creation of the Cluster, you can find the Cluster on Workbench.
Similarly, you can add as many as Depots in a Cluster.

Creating a Minerva Cluster¶

Configure a manifest file containing the following code.

# Resource meta section
name: minervatest
version: v1
type: cluster
description: testing 
tags:
  - cluster

# Cluster-specific section
cluster:
  compute: query-default
  type: minerva

  # Minerva-specific section
  minerva:
    replicas: 1
    resources:
      limits:
        cpu: 2000m
        memory: 4Gi
      requests:
        cpu: 2000m
        memory: 4Gi
    depots:
      - address: dataos://icebase

Attribute	Description
`name`	The name of the Resource. In this case, it is `minervatest`, representing the name of the cluster resource.
`version`	Specifies the version of the resource. Here it is `v1`, indicating that this is the first version of the cluster.
`type`	Defines the type of resource. The value `cluster` indicates that this is a cluster resource.
`description`	Provides a brief description of the cluster, stating that it is being used for testing purposes.
`tags`	A list of tags used for categorization. The tag `cluster` classifies the resource as a cluster.
`compute`	Defines the compute environment for the cluster. Here, `query-default` indicates the default configuration for processing queries.
`type`	Specifies the type of the cluster. The value `minerva` indicates that the cluster is using Minerva-specific technology.
`replicas`	Defines the number of replicas for the cluster. Here, `1` replica is specified, indicating a single instance of the cluster.
`resources`	Specifies the resource requests and limits for the cluster. It includes CPU and memory configurations for both limits and requests.
`limits`	Defines the maximum resources the cluster can consume. Here, the CPU limit is `2000m` (2 CPUs) and memory limit is `4Gi`.
`requests`	Defines the resources requested by the cluster. It includes `2000m` (2 CPUs) of CPU and `4Gi` of memory, matching the resource limits.
`debug`	Defines the logging configuration for the cluster. The `logLevel` is set to `INFO` for general logs, and the `trinoLogLevel` is set to `ERROR` for Trino logs.
`depots`	A list of depots (storage locations) associated with the cluster.
`address`	Specifies the location of the depot. In this case, it points to `dataos://icebase`, indicating a depot in the DataOS.

Apply the Cluster manifest using the following command in the terminal.
```
dataos-ctl apply -f /home/cluster/minerva_cluster.yaml
```

Verify the Cluster by executing the dataos-ctl get -t cluster -w public -r command in the terminal.

       NAME       | VERSION |  TYPE   | WORKSPACE | STATUS |  RUNTIME  |    OWNER     
------------------|---------|---------|-----------|--------|-----------|--------------
  minervatest     | v1      | cluster | public    | active | running:1 | iamgroot

After the successful creation of the Cluster, you can find the Cluster on Workbench.
Similarly, you can add as many Depots as you need in a Cluster.