Lakehouse¶

Lakehouse is a DataOS Resource that merges Apache Iceberg table format with cloud object storage, yielding a fully managed storage architecture that blends the strengths of data lakes and data warehouses. It enables a novel approach to system design, incorporating features typically found in data warehouses—such as the creation of tables with defined schemas, data manipulation using a variety of tools, and sophisticated data management capabilities—directly on top of cost-effective cloud storage in open formats.

How to create and manage a Lakehouse?

Learn how to create and manage a Lakehouse in DataOS.

Create and manage a Lakehouse
How to configure the manifest file of Lakehouse?

Discover how to configure the manifest file of a Lakehouse by adjusting its attributes.

Lakehouse attributes
How to manage datasets in a Lakehouse?

Various CLI commands related to performing DDL/DML operations on datasets in a Lakehouse.

Managing datasets in Lakehouse
How to use a Lakehouse in DataOS?

Explore examples showcasing the usage of Lakehouse Resource in various scenarios.

Lakehouse usage recipes

Key Features of a Lakehouse¶

The DataOS Lakehouse integrates essential features of Relational Data Warehouses with the scalability and adaptability of data lakes. Here's an outline of its core features:

Decoupled Storage from Compute: The Lakehouse architecture decouples storage from computational resources, permitting independent scaling. This enables handling larger datasets and more simultaneous users efficiently.
ACID Transactions Support: Essential for data integrity during simultaneous accesses, ACID transaction support ensures consistent and reliable data amidst concurrent operations.
Versatile Workload Management: Designed to facilitate a range of tasks from analytics to machine learning, the Lakehouse serves as a unified repository, streamlining data management.
Flexible Computing Environments: Supports a diverse array of cloud-native storage and processing environments, including DataOS native stacks like Flare, Soda, etc.
Openness and Standardization: Embracing open file formats like Parquet ensures efficient data retrieval across various tools and platforms.
Branching Capabilities: Employs Iceberg's branching features to support schema versioning and experimentation, enabling safe testing and iteration without affecting live data.

Architecture of a Lakehouse¶

DataOS Lakehouse architecture comprises of several layers that come together to form a cohesive environment for data management. The layers are described below:

Storage: Acts as the foundational storage layer, interfacing with the cloud storage services (e.g., GCS, ABFSS, WASBS, Amazon S3). It abstracts out the storage connection details by creating a Depot Resource on applying, while the credentials are securely referred using Instance Secrets. The Lakehouse storage, utilizes Parquet for efficiently handling large datasets, and employs the Iceberg format for table metadata management.
Metastore: Facilitates access to metadata related to the stored data through the utilization of the Iceberg REST metastore. It exposes Iceberg catalogs, e.g. Hadoop and Hive, via REST metastore interfaces, thus facilitating metadata management.
Query Engine: Provides the computing environment for running data queries and analytics. It supports the Themis Query Engine which is provisioned through the Cluster Resource.

Together, these layers come together to form the DataOS Lakehouse architecture, ensuring it not only serves as a repository for vast amounts of data but also as a powerful component for data analysis and insights.

How to create and manage a Lakehouse?¶

Prerequisites¶

Before proceeding with the Lakehouse creation, ensure the following prerequisites are met:

Object Storage account

Data developers need access to an object storage solution. Ensure you have the storage credentials ready with 'Storage Admin' access level. The following object storage solutions are supported:

Azure Blob File System Storage (ABFSS)
Windows Azure Storage Blob Service (WASBS)
Amazon Simple Storage Service (Amazon S3)
Google Cloud Storage (GCS)

Access level permission

To set up a Lakehouse in DataOS, besides possessing an object storage account with appropriate permissions, you also require specific tags or use-cases that authorize you to create and manage a Lakehouse within DataOS.

Creating a Lakehouse¶

Create Instance Secrets¶

Instance Secrets are vital for securely storing sensitive information like data source credentials. These Instance-secrets ensure that credentials are kept safe in the Heimdall vault, making them accessible throughout the DataOS instance without exposing them directly in your Lakehouse manifest file. Here’s how you can create Instance-secrets:

Steps to Create Instance Secrets

Prepare the manifest file for Instance-secret: You need to create a manifest file (YAML configuration file) that contains the source credentials for your chosen object storage solution (ABFSS, WASBS, Amazon S3, or GCS). This file should also specify the level of access control (read-only ‘r’ or read-write ‘rw’) that the Lakehouse will have over the object storage. A sample Instance-secret manifest is provided below:

Sample Instance Secret manifest file

name: depotsecret-r # Resource name (mandatory)
version: v1 # Manifest version (mandatory)
type: instance-secret # Resource-type (mandatory)
tags: # Tags (optional)
  - just for practice
description: instance secret configuration # Description of Resource (optional)
layer: user
instance-secret: # Instance Secret mapping (mandatory)
  type: key-value-properties # Type of Instance-secret (mandatory)
  acl: r # Access control list (mandatory)
  data: # Data section mapping (mandatory)
    username: iamgroot
    password: yourpassword

You can refer to the following link to get the templates for the Instance-secret manifests for object stores.

Applying the Manifest file using DataOS CLI: Once your manifest file is ready, you can apply it using the DataOS Command Line Interface (CLI), by the following command:

CommandExample

dataos-ctl resource apply -f ${manifest-file-path} -w ${workspace}

dataos-ctl resource apply -f data_product/instance_secret.yaml -w curriculum

Alternate command

CommandExample

dataos-ctl apply -f ${manifest-file-path} -w ${workspace}

dataos-ctl apply -f ../data_product/instance_secret.yaml -w curriculum

Verify Instance-secret creation: To ensure that your Instance-secret has been successfully created, you can verify it in two ways:

Check the name of the newly created Instance-secret in the list of Instance-secret created by you using the resource get command:
```
dataos-ctl resource get -t instance-secret
```
Alternatively, retrieve the list of all Instance-secret created by all users in a DataOS instance by appending -a flag:
```
dataos-ctl resource get -t instance-secret -a
```
You can also access the details of any created Instance-secret through the DataOS GUI in the Resource tab of the Operations app.

For more information about Instance-secret, refer to the documentation: Instance-secret.

Draft a Lakehouse manifest file¶

Once you have created Instance-secrets, now its time to create a Lakehouse by applying the Lakehouse manifest file using the DataOS CLI. The Lakehouse manifest file is divided into several sections, each responsible for specifying different aspects of the Lakehouse. The sections are provided below:

Resource meta section
Lakehouse-specific section
- Storage section
- Metastore section
- Query Engine section

A sample Lakehouse manifest file is provided below; the sections that make up the various parts of the manifest file are described after that.

Sample Lakehouse manifest file

# Resource-meta section (1)
name: alphaomega
version: v1alpha
type: lakehouse
tags:
  - Iceberg
  - Azure
description: Icebase depot of storage-type S3
owner: iamgroot
layer: user

# Lakehouse-specific section (2)
lakehouse:
  type: iceberg
  compute: runnable-default
  iceberg:

    # Storage section (3)
    storage:
      depotName: alphaomega
      type: s3
      s3:
        bucket: dataos-lakehouse   
        relativePath: /test
      secrets:
        - name: alphaomega-r
          keys:
            - alphaomega-r
          allkeys: true 
        - name: alphaomega-rw
          keys:
            - alphaomega-rw
          allkeys: true  

    # Metastore section (4)
    metastore:
      type: "iceberg-rest-catalog"

    # Query engine section (5)
    queryEngine:
      type: themis

Resource meta section within a manifest file comprises metadata attributes universally applicable to all Resource-types. To learn more about how to configure attributes within this section, refer to the link: Attributes of Resource meta section.
Lakehouse-specific section within a manifest file comprises attributes specific to the Lakehouse Resource. This section is further subdivided into: Storage, Metastore, and Query Engine section. To learn more about how to configure attributes of Lakehouse-specific section, refer the link: Attributes of Lakehouse-specific section.
Storage section comprises attributes for storage configuration.
Metastore section comprises attributes for metastore configuration.
Query Engine section comprises attributes for query engine configuration.

Resource meta section

This section serves as the header of the manifest file, defining the overall characteristics of the Lakehouse Resource you wish to create. It includes attributes common to all types of Resources in DataOS. These attributes help DataOS in identifying, categorizing, and managing the Resource within its ecosystem. The code block below describes the attributes of this section:

SyntaxExample

# Resource-meta section
name: ${resource-name} # mandatory
version: v1alpha # mandatory
type: lakehouse # optional
tags: # optional
  - ${tag1}
  - ${tag2}
description: ${description} # optional
owner: ${userid-of-owner} # optional
layer: user # optional

# Resource-meta section
name: lakehouse-s3 # mandatory
version: v1alpha # mandatory
type: lakehouse # mandatory
tags: # optional
  - lakehouse
  - s3
description: The manifest file for Lakehouse Resource # optional
owner: iamgroot # optional
layer: user # optional

Refer to the Attributes of Resource meta section for more information about the various attributes in the Resource meta section.

Lakehouse-specific section

Following the Resource meta section, the Lakehouse-specific section contains configurations unique to the Lakehouse Resource.

SyntaxExample

lakehouse:
  type: ${lakehouse-type} # mandatory 
  compute: ${compute} # mandatory 
  runAsApiKey: ${dataos-apikey} # optional
  runAsUser: ${user-id} # optional
  iceberg: # mandatory
    storage: 
      # storage section attributes
    metaStore: 
      # metastore section attributes
    queryEngine: # 
      # query engine section attributes

lakehouse:
  type: iceberg # mandatory 
  compute: query-default # mandatory 
  runAsApiKey: abcdefghijklmnopqrstuvwxyz # optional
  runAsUser: iamgroot # optional
  iceberg: # mandatory
    storage: 
      # storage section attributes
    metaStore:
      # metastore section attributes
    queryEngine:
      # query engine section attributes

Attribute	Data Type	Default Value	Possible Value	Requirement
`lakehouse`	mapping	none	none	mandatory
`type`	string	none	iceberg	mandatory
`compute`	string	none	valid query-type Compute Resource name	mandatory
`runAsApiKey`	mapping	api key of user applying the Lakehouse	any valid DataOS apikey	optional
`runAsUser`	string	user-id of owner	user-id of use-case assignee	optional
`iceberg`	mapping	none	none	mandatory
`storage`	mapping	none	valid storage configuration	mandatory
`metaStore`	mapping	none	valid metastore configuration	optional
`queryEngine`	mapping	none	valid query engine configuration	optional

This section is divided into three separate sections, each critical to the Lakehouse’s functionality:

Storage section
Metastore section
Query engine section

Storage section

This section of the Lakehouse manifest file specifies the connection to the underlying object storage solution (e.g., ABFSS, WASBS, Amazon S3, GCS). Instance-secrets enable the secure reference of sensitive data within the manifest. The Storage section's configurations facilitate the creation of a Depot, abstracting the storage setup and ensuring secured data access in the object storage solution. This setup varies across different source systems, as detailed in the tabs below:

ABFSSGCSS3WASBS

To setup a Lakehouse on top of ABFSS source system, you need to configure the storage with type: abfss. The code block below elucidates the storage section configuration for 'ABFSS':

SyntaxExample

storage:
  depotName: ${depot-name} # optional
  type: abfss # mandatory
  abfss: # optional
    account: ${abfss-account} # optional
    container: ${container} # optional
    endpointSuffix: ${endpoint-suffix}
    format: ${format} # optional
    icebergCatalogType: ${iceberg-catalog-type} # optional
    metastoreType: ${metastore-type} # optional
    metastoreUrl: ${metastore-url} # optional
    relativePath: ${relative-path} # optional
  secrets:
    - name: ${referred-secret-name} # mandatory
      workspace: ${secret-workspace} # optional 
      key: ${secret-key} # optional
      keys: # optional 
        - ${key1}
        - ${key2}
      allKeys: ${all-keys-or-not} # optional
      consumptionType: ${consumption-type} # optional

storage:
  type: "abfss"
  abfss:
    depotName: abfsslakehouse
    account: abfssstorage
    container: lake01
    relativePath: "/dataos"
    format: ICEBERG
    endpointSuffix: dfs.core.windows.net
  secrets:
    - name: abfsslakehouse-rw
      keys:
        - abfsslakehouse-rw
      allkeys: true    
    - name: abfsslakehouse-r
      keys:
        - abfsslakehouse-r
      allkeys: true

The table below summarizes the attributes of 'abfss' storage configuration:

Attribute	Data Type	Default Value	Possible Value	Requirement
`storage`	mapping	none	none	mandatory
`depotName`	string	${lakehouse-name}0 ${workspace}0 storage	A valid string that matches the regex pattern `[a-z]([a-z0-9]*)`. Special characters, except for hyphens/dashes, are not allowed. The maximum length is 48 characters.	optional
`type`	string	none	abfss	mandatory
`abfss`	mapping	none	none	optional
`account`	string	none	valid ABFSS account	optional
`container`	string	none	valid container name	optional
`endpointSuffix`	string	none	valid endpoint suffix	optional
`format`	string	Iceberg	Iceberg	optional
`icebergCatalogType`	string	hadoop	hadoop, hive	optional
`metastoreType`	string	iceberg-rest-catalog	iceberg-rest-catalog	optional
`metastoreUrl`	string	none	valid URL	optional
`relativePath`	string	none	valid relative path	optional
`secret`	mapping	none	none	mandatory
`name`	string	none	valid Secret name	mandatory
`workspace`	string	none	valid Workspace name and must be less than '32' chars and conform to the following regex: `[a-z]([-a-z0-9]*[a-z0-9])?`	optional
`key`	string	none	valid key	optional
`keys`	list of strings	none	valid keys	optional
`allKeys`	boolean	false	true/false	optional
`consumptionType`	string	envVars	envVars, propFile	optional

Attributes of ABFSS storage configuration

To setup a Lakehouse on top of GCS source system, you need to configure the storage with type: gcs. The code block below elucidates the storage section configuration for 'GCS':

SyntaxExample

storage:
  depotName: ${depot-name} # optional
  type: gcs # mandatory
  gcs: # mandatory
    bucket: ${gcs-bucket} # mandatory
    format: ${format} # mandatory
    icebergCatalogType: ${iceberg-catalog-type} # optional
    metastoreType: ${metastore-type} # optional
    metastoreUrl: ${metastore-url} # optional
    relativePath: ${relative-path} # optional
  secrets:
    - name: ${referred-secret-name} # mandatory
      workspace: ${secret-workspace} # optional 
      key: ${secret-key} # optional
      keys: # optional 
        - ${key1}
        - ${key2}
      allKeys: ${all-keys-or-not} # optional
      consumptionType: ${consumption-type} # optional

storage:
  depotName: gcslakehouse
  type: gcs
  gcs:
    bucket: gcsbucket
    relativePath: "/sanity"
    format: iceberg      
  secrets:
    - name: gcslakehouse-rw
      keys:
        - gcslakehouse-rw
      allkeys: true    
    - name: gcslakehouse-r
      keys:
        - gcslakehouse-r
      allkeys: true

The table below summarizes the attributes of 'gcs' storage configuration:

Attribute	Data Type	Default Value	Possible Value	Requirement
`storage`	mapping	none	none	mandatory
`depotName`	string	${lakehouse-name}0 ${workspace}0 storage	A valid string that matches the regex pattern `[a-z]([a-z0-9]*)`. Special characters, except for hyphens/dashes, are not allowed. The maximum length is 48 characters.	optional
`type`	string	none	gcs	mandatory
`gcs`	mapping	none	none	optional
`bucket`	string	none	valid GCS bucket name	optional
`format`	string	Iceberg	Iceberg	optional
`icebergCatalogType`	string	none	hadoop, hive	optional
`metastoreType`	string	iceberg-rest-catalog	iceberg-rest-catalog	optional
`metastoreUrl`	string	none	valid metastore URL	optional
`relativePath`	string	none	valid relative path	optional
`secret`	mapping	none	none	mandatory
`name`	string	none	valid Secret name	mandatory
`workspace`	string	none	valid Workspace name and must be less than '32' chars and conform to the following regex: `[a-z]([-a-z0-9]*[a-z0-9])?`	optional
`key`	string	none	valid key	optional
`keys`	list of strings	none	valid keys	optional
`allKeys`	boolean	false	true/false	optional
`consumptionType`	string	envVars	envVars, propFile	optional

Attributes of GCS storage configuration

To setup a Lakehouse on top of S3 source system, you need to configure the storage with type: s3. The code block below elucidates the storage section configuration for 'S3':

SyntaxExample

storage:
  depotName: ${depot-name} # optional
  type: s3 # mandatory
  s3: # mandatory
    bucket: ${s3-bucket} # mandatory
    format: ${format} # mandatory
    icebergCatalogType: ${iceberg-catalog-type} # optional
    metastoreType: ${metastore-type} # optional
    metastoreUrl: ${metastore-url} # optional
    relativePath: ${relative-path} # optional
    scheme: ${scheme} # optional
  secrets:
    - name: ${referred-secret-name} # mandatory
      workspace: ${secret-workspace} # optional 
      key: ${secret-key} # optional
      keys: # optional 
        - ${key1}
        - ${key2}
      allKeys: ${all-keys-or-not} # optional
      consumptionType: ${consumption-type} # optional

storage:
  depotName: s3test
  type: "s3"
  s3:
    bucket: lake001-dev        # "tmdc-dataos-testing"
    relativePath: /sanitys3       
  secrets:
    - name: s3test-rw
      keys:
        - s3test-rw
      allkeys: true    
    - name: s3test-r
      keys:
        - s3test-r
      allkeys: true

The table below summarizes the attributes of 's3' storage configuration:

Attribute	Data Type	Default Value	Possible Value	Requirement
`storage`	mapping	none	none	mandatory
`depotName`	string	${lakehouse-name}0 ${workspace}0 storage	A valid string that matches the regex pattern `[a-z]([a-z0-9]*)`. Special characters, except for hyphens/dashes, are not allowed. The maximum length is 48 characters.	optional
`type`	string	none	s3	mandatory
`s3`	mapping	none	none	optional
`bucket`	string	none	valid S3 bucket name	optional
`format`	string	Iceberg	Iceberg	optional
`icebergCatalogType`	string	none	hadoop, hive	optional
`metastoreType`	string	iceberg-rest-catalog	iceberg-rest-catalog	optional
`metastoreUrl`	string	none	valid URL	optional
`relativePath`	string	none	valid relative path	optional
`scheme`	string	none	valid scheme (e.g., s3://)	optional
`secret`	mapping	none	none	mandatory
`name`	string	none	valid Secret name	mandatory
`workspace`	string	none	valid Workspace name and must be less than '32' chars and conform to the following regex: `[a-z]([-a-z0-9]*[a-z0-9])?`	optional
`key`	string	none	valid key	optional
`keys`	list of strings	none	valid keys	optional
`allKeys`	boolean	false	true/false	optional
`consumptionType`	string	envVars	envVars, propFile	optional

Attributes of S3 storage configuration

To setup a Lakehouse on top of WASBS source system, you need to configure the storage with type: wasbs. The code block below elucidates the storage section configuration for 'WASBS':

SyntaxExample

storage:
  depotName: ${depot-name} # optional
  type: wasbs # mandatory
  wasbs: # optional
    account: ${abfss-account} # optional
    container: ${container} # optional
    endpointSuffix: ${endpoint-suffix}
    format: ${format} # optional
    icebergCatalogType: ${iceberg-catalog-type} # optional
    metastoreType: ${metastore-type} # optional
    metastoreUrl: ${metastore-url} # optional
    relativePath: ${relative-path} # optional
  secrets:
    - name: ${referred-secret-name} # mandatory
      workspace: ${secret-workspace} # optional 
      key: ${secret-key} # optional
      keys: # optional 
        - ${key1}
        - ${key2}
      allKeys: ${all-keys-or-not} # optional
      consumptionType: ${consumption-type} # optional

storage:
  type: "wasbs"
  wasbs:
    depotName: wasbslakehouse
    account: wasbsstorage
    container: lake01
    relativePath: "/dataos"
    format: ICEBERG
    endpointSuffix: dfs.core.windows.net
  secrets:
    - name: wasbslakehouse-rw
      keys:
        - wasbslakehouse-rw
      allkeys: true    
    - name: wasbslakehouse-r
      keys:
        - wasbslakehouse-r
      allkeys: true

The table below summarizes the attributes of 'wasbs' storage configuration:

Attribute	Data Type	Default Value	Possible Value	Requirement
`storage`	mapping	none	none	mandatory
`depotName`	string	${lakehouse-name}0 ${workspace}0 storage	A valid string that matches the regex pattern `[a-z]([a-z0-9]*)`. Special characters, except for hyphens/dashes, are not allowed. The maximum length is 48 characters.	optional
`type`	string	none	wasbs	mandatory
`wasbs`	mapping	none	none	optional
`account`	string	none	valid ABFSS account	optional
`container`	string	none	valid container name	optional
`endpointSuffix`	string	none	valid endpoint suffix	optional
`format`	string	Iceberg	Iceberg	optional
`icebergCatalogType`	string	hadoop	hadoop, hive	optional
`metastoreType`	string	iceberg-rest-catalog	iceberg-rest-catalog	optional
`metastoreUrl`	string	none	valid URL	optional
`relativePath`	string	none	valid relative path	optional
`secret`	mapping	none	none	mandatory
`name`	string	none	valid Secret name	mandatory
`workspace`	string	none	valid Workspace name and must be less than '32' chars and conform to the following regex: `[a-z]([-a-z0-9]*[a-z0-9])?`	optional
`key`	string	none	valid key	optional
`keys`	list of strings	none	valid keys	optional
`allKeys`	boolean	false	true/false	optional
`consumptionType`	string	envVars	envVars, propFile	optional

Attributes of WASBS storage configuration

Metastore section

This section outlines the metastore configuration, which manages metadata for the data stored in the Lakehouse storage. It includes the metastore service type and detailed setup instructions.

Configurations range from simple, requiring just the metastore type (e.g., iceberg-rest-catalog), to complex, incorporating additional features for enhanced scalability and performance. Advanced configurations may detail the number of replicas, autoscaling capabilities, and specific resource allocations.

Basic configurationAdvanced configuration

SyntaxExample

lakehouse:
  metastore:
    type: ${metasatore-type}

lakehouse:
  metastore:
    type: iceberg-rest-catalog

The table below elucidates the basic configuration attributes of Metastore section:

Attribute	Data Type	Default Value	Possible Value	Requirement
`metastore`	mapping	none	none	optional
`type`	string	none	iceberg-rest-catalog	mandatory

Basic configuration attributes of Metastore section

SyntaxExample

metastore:
  type: ${metastore-type} # mandatory
  replicas: ${number-of-replicas}
  autoScaling:
    enabled: ${enable-autoscaling}
    minReplicas: ${minimum-number-of-replicas}
    maxReplicas: ${maximum-number-of-replicas}
    targetMemoryUtilizationPercentage: ${target-memory-utilization-percentage}
    targetCPUUtilizationPercentage: ${target-cpu-utilization-percentage}
  resources:
    requests:
      cpu: ${requested-cpu-resource}
      memory: ${requested-memory-resource}
    limits:
      cpu: ${requested-cpu-resource}
      memory: ${requested-memory-resource}

metastore:
  type: iceberg-rest-catalog # mandatory
  replicas: 2
  autoScaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 4
    targetMemoryUtilizationPercentage: 60
    targetCPUUtilizationPercentage: 60
  resources:
    requests:
      cpu: 1Gi
      memory: 400m
    limits:
      cpu: 2Gi
      memory: 1000m

The table below elucidates the basic configuration attributes of Metastore section:

Attribute	Data Type	Default Value	Possible Value	Requirement
`metastore`	mapping	none	none	optional
`type`	string	none	iceberg-rest-catalog	mandatory
`replicas`	integer	none	any valid positive integer	optional
`autoscaling`	mapping	none	none	optional
`enabled`	boolean	false	true/false	optional
`minReplicas`	integer	none	any valid integer	optional
`maxReplicas`	integer	none	any valid integer greater than `minReplicas`	optional
`targetMemoryUtilizationPercentage`	integer	none	any valid percentage	optional
`targetCPUUtilizationPercentage`	integer	none	any valid percentage	optional
`resources`	mapping	none	none	optional
`requests`	mapping	none	none	optional
`limits`	mapping	none	none	optional
`cpu`	string	none	any valid resource amount	optional
`memory`	string	none	any valid resource amount	optional

Advanced configuration attributes of Metastore section

Query Engine section

The query engine section facilitates the creation of a Cluster Resource, enabling data queries against the Lakehouse storage. Currently, only the Themis query engine is supported.

Basic configurations might be adequate for standard use cases, outlining merely the type of query engine. For environments demanding more precise resource management, advanced configurations offer customization options, including specific CPU and memory requests and limits, to ensure the query engine operates efficiently within set resource constraints.

Basic configurationAdvanced configuration

SyntaxExample

queryEngine:
  type: ${query-engine-type}

queryEngine:
  type: themis

The table below elucidates the basic configuration attributes of Metastore section:

Attribute	Data Type	Default Value	Possible Value	Requirement
`queryEngine`	mapping	none	none	optional
`type`	string	none	themis	mandatory

Basic configuration attributes of Query Engine section

SyntaxExample

queryEngine:
  type: ${query-engine-type} # mandatory
  resources:
    requests:
      cpu: ${requested-cpu-resource}
      memory: ${requested-memory-resource}
    limits:
      cpu: ${requested-cpu-resource}
      memory: ${requested-memory-resource}
  themis:
    envs:
      ${environment-variables}
    themisConf:
      ${themis-configuration}
    spark:
      driver: 
        resources:
          requests:
            cpu: ${requested-cpu-resource}
            memory: ${requested-memory-resource}
          limits:
            cpu: ${requested-cpu-resource}
            memory: ${requested-memory-resource}
        instanceCount: ${instance-count} # mandatory
        maxInstanceCount: ${max-instance-count} # mandatory
      executor:
        resources:
          requests:
            cpu: ${requested-cpu-resource}
            memory: ${requested-memory-resource}
          limits:
            cpu: ${requested-cpu-resource}
            memory: ${requested-memory-resource}
        instanceCount: ${instance-count} # mandatory
        maxInstanceCount: ${max-instance-count} # mandatory
      sparkConf:
        ${spark-configuration}
  storageAcl: ${storage-acl} # mandatory

queryEngine:
  type: themis # mandatory
  resources:
    requests:
      cpu: 1000m
      memory: 2Gi
    limits:
      cpu: 2000m
      memory: 4Gi
  themis:
    envs:
      alpha: beta
    themisConf:
      "kyuubi.frontend.thrift.binary.bind.host": "0.0.0.0"
      "kyuubi.frontend.thrift.binary.bind.port": "10101"
    spark:
      driver: 
        resources:
          requests:
            cpu: 1Gi
            memory: 400m
          limits:
            cpu: 2Gi
            memory: 1000m
        instanceCount: 2 # mandatory
        maxInstanceCount: 3 # mandatory
      executor:
        resources:
          requests:
            cpu: 1Gi
            memory: 400m
          limits:
            cpu: 2Gi
            memory: 1000m
        instanceCount: 2 # mandatory
        maxInstanceCount: 3 # mandatory
      sparkConf:
        spark.dynamicAllocation.enabled: true
  storageAcl: r # mandatory

The table below elucidates the advanced configuration attributes of Query Engine section:

Attribute	Data Type	Default Value	Possible Value	Requirement
`queryEngine`	mapping	none	none	mandatory
`type`	string	none	themis	mandatory
`resources`	mapping	none	none	optional
`requests`	mapping	none	none	optional
`cpu`	string	none	any valid CPU resource amount	optional
`memory`	string	none	any valid memory resource amount	optional
`limits`	mapping	none	none	optional
`cpu`	string	none	any valid CPU resource limit	optional
`memory`	string	none	any valid memory resource limit	optional
`themis`	mapping	none	none	optional
`envs`	mapping	none	none	optional
`themisConf`	mapping	none	none	optional
`spark`	mapping	none	none	mandatory
`driver`	mapping	none	none	mandatory
`memory`	string	none	any valid memory amount	mandatory
`cpu`	string	none	any valid CPU resource	mandatory
`executor`	mapping	none	none	mandatory
`memory`	string	none	any valid memory amount	mandatory
`cpu`	string	none	any valid CPU resource	mandatory
`instanceCount`	integer	none	any valid integer	mandatory
`maxInstanceCount`	integer	none	any valid integer	mandatory
`sparkConf`	mapping	none	none	optional

Advanced configuration attributes of Query Engine section

Apply the Lakehouse manifest¶

After creating the manifest file for the Lakehouse Resource, it's time to apply it to instantiate the Resource-instance in the DataOS environment. To apply the Lakehouse manifest file, utilize the apply command.

CommandExample

dataos-ctl apply -f ${manifest-file-path} - w ${workspace}

dataos-ctl apply -f dataproducts/new-lakehouse.yaml -w curriculum

The links provided below showcase the process of creating Lakehouse for a particular data source:

Managing a Lakehouse¶

Verify Lakehouse Creation¶

To ensure that your Lakehouse has been successfully created, you can verify it in two ways:

Check the name of the newly created Lakehouse in the list of lakehouses created by you in a particular Workspace:

dataos-ctl get -t lakehouse - w ${workspace name}

Sample

dataos-ctl get -t lakehouse -w curriculum

Alternatively, retrieve the list of all Lakehouses created in the Workspace by appending -a flag:

dataos-ctl get -t lakehouse -w ${workspace name} -a
# Sample
dataos-ctl get -t lakehouse -w curriculum

You can also access the details of any created Lakehouse through the DataOS GUI in the Resource tab of the Operations app.

Deleting a Lakehouse¶

Use the delete command to remove the specific Lakehouse Resource Instance from the DataOS environment. As shown below, there are three ways to delete a Lakehouse.

Method 1: Copy the Lakehouse name, version, Resource-type and Workspace name from the output of the get command seperated by '|' enclosed within quotes and use it as a string in the delete command.

Command

dataos-ctl delete -i "${identifier string}"

Example

dataos-ctl delete -i "cnt-lakehouse-demo-01 | v1alpha | lakehouse | public"

Output

INFO[0000] 🗑 delete...
INFO[0001] 🗑 deleting(public) cnt-lakehouse-demo-01:v1alpha:lakehouse...
INFO[0003] 🗑 deleting(public) cnt-lakehouse-demo-01:v1alpha:lakehouse...deleted
INFO[0003] 🗑 delete...complete

Method 2: Specify the path of the YAML file and use the delete command.

Command

dataos-ctl delete -f ${manifest-file-path}

Example

dataos-ctl delete -f /home/desktop/connect-city/config_v1alpha.yaml

Output

INFO[0000] 🗑 delete...
INFO[0000] 🗑 deleting(public) cnt-lakehouse-demo-010:v1alpha:lakehouse...
INFO[0001] 🗑 deleting(public) cnt-lakehouse-demo-010:v1alpha:lakehouse...deleted
INFO[0001] 🗑 delete...complete

Method 3: Specify the Workspace, Resource-type, and Lakehouse name in the delete command.

Command

dataos-ctl delete -w ${workspace} -t lakehouse -n ${lakehouse name}

Example

dataos-ctl delete -w public -t lakehouse -n cnt-product-demo-01

Output

INFO[0000] 🗑 delete...
INFO[0000] 🗑 deleting(public) cnt-city-demo-010:v1alpha:lakehouse...
INFO[0001] 🗑 deleting(public) cnt-city-demo-010:v1alpha:lakehouse...deleted
INFO[0001] 🗑 delete...complete

How to configure the manifest file of Lakehouse?¶

The Attributes of Lakehouse manifest define the key properties and configurations that can be used to specify and customize Lakehouse Resources within a manifest file. These attributes allow data developers to define the structure and behavior of their Lakehouse Resources. For comprehensive information on each attribute and its usage, please refer to the link: Attributes of Lakehouse manifest.

How to manage Lakehouse Resource and datasets using CLI?¶

This section provides a comprehensive guide for managing Lakehouse Resource and inspecting datasets stored in Lakehouse storage. Utilizing the dataset command, users can perform a wide array of Data Definition Language (DDL)-related tasks, streamlining operations such as adding or removing columns, editing dataset metadata, and listing snapshots, among others. To learn more about these commands, refer to the link: Lakehouse Command Reference.

Lakehouse¶

Key Features of a Lakehouse¶

Architecture of a Lakehouse¶

How to create and manage a Lakehouse?¶

Prerequisites¶

Creating a Lakehouse¶

Create Instance Secrets¶

Draft a Lakehouse manifest file¶

Apply the Lakehouse manifest¶

Managing a Lakehouse¶

Verify Lakehouse Creation¶

Deleting a Lakehouse¶

How to configure the manifest file of Lakehouse?¶

How to manage Lakehouse Resource and datasets using CLI?¶

How to use a Lakehouse in DataOS?¶