Skip to content

Lakehouse

Lakehouse is a DataOS Resource that merges Apache Iceberg table format with cloud object storage, yielding a fully managed storage architecture that blends the strengths of data lakes and data warehouses. It enables a novel approach to system design, incorporating features typically found in data warehouses—such as the creation of tables with defined schemas, data manipulation using a variety of tools, and sophisticated data management capabilities—directly on top of cost-effective cloud storage in open formats.

  • How to create and manage a Lakehouse?


    Learn how to create and manage a Lakehouse in DataOS.

    Create and manage a Lakehouse

  • How to configure the manifest file of Lakehouse?


    Discover how to configure the manifest file of a Lakehouse by adjusting its attributes.

    Lakehouse attributes

  • How to manage datasets in a Lakehouse?


    Various CLI commands related to performing DDL/DML operations on datasets in a Lakehouse.

    Managing datasets in Lakehouse

  • How to use a Lakehouse in DataOS?


    Explore examples showcasing the usage of Lakehouse Resource in various scenarios.

    Lakehouse usage recipes

Key Features of a Lakehouse

The DataOS Lakehouse integrates essential features of Relational Data Warehouses with the scalability and adaptability of data lakes. Here's an outline of its core features:

  • Decoupled Storage from Compute: The Lakehouse architecture decouples storage from computational resources, permitting independent scaling. This enables handling larger datasets and more simultaneous users efficiently.
  • ACID Transactions Support: Essential for data integrity during simultaneous accesses, ACID transaction support ensures consistent and reliable data amidst concurrent operations.
  • Versatile Workload Management: Designed to facilitate a range of tasks from analytics to machine learning, the Lakehouse serves as a unified repository, streamlining data management.
  • Flexible Computing Environments: Supports a diverse array of cloud-native storage and processing environments, including DataOS native stacks like Flare, Soda, etc.
  • Openness and Standardization: Embracing open file formats like Parquet ensures efficient data retrieval across various tools and platforms.
  • Branching Capabilities: Employs Iceberg's branching features to support schema versioning and experimentation, enabling safe testing and iteration without affecting live data.

Architecture of a Lakehouse

DataOS Lakehouse architecture comprises of several layers that come together to form a cohesive environment for data management. The layers are described below:

  • Storage: Acts as the foundational storage layer, interfacing with the cloud storage services (e.g., GCS, ABFSS, WASBS, Amazon S3). It abstracts out the storage connection details by creating a Depot Resource on applying, while the credentials are securely referred using Instance Secrets. The Lakehouse storage, utilizes Parquet for efficiently handling large datasets, and employs the Iceberg format for table metadata management.
  • Metastore: Facilitates access to metadata related to the stored data through the utilization of the Iceberg REST metastore. It exposes Iceberg catalogs, e.g. Hadoop and Hive, via REST metastore interfaces, thus facilitating metadata management.
  • Query Engine: Provides the computing environment for running data queries and analytics. It supports the Themis Query Engine which is provisioned through the Cluster Resource.

Together, these layers come together to form the DataOS Lakehouse architecture, ensuring it not only serves as a repository for vast amounts of data but also as a powerful component for data analysis and insights.

How to create and manage a Lakehouse?

Prerequisites

Before proceeding with the Lakehouse creation, ensure the following prerequisites are met:

Object Storage account

Data developers need access to an object storage solution. Ensure you have the storage credentials ready with 'Storage Admin' access level. The following object storage solutions are supported:

  • Azure Blob File System Storage (ABFSS)
  • Windows Azure Storage Blob Service (WASBS)
  • Amazon Simple Storage Service (Amazon S3)
  • Google Cloud Storage (GCS)

Access level permission

To set up a Lakehouse in DataOS, besides possessing an object storage account with appropriate permissions, you also require specific tags or use-cases that authorize you to create and manage a Lakehouse within DataOS.

Creating a Lakehouse

Create Instance Secrets

Instance Secrets are vital for securely storing sensitive information like data source credentials. These Instance-secrets ensure that credentials are kept safe in the Heimdall vault, making them accessible throughout the DataOS instance without exposing them directly in your Lakehouse manifest file. Here’s how you can create Instance-secrets:

Steps to Create Instance Secrets

  • Prepare the manifest file for Instance-secret: You need to create a manifest file (YAML configuration file) that contains the source credentials for your chosen object storage solution (ABFSS, WASBS, Amazon S3, or GCS). This file should also specify the level of access control (read-only ‘r’ or read-write ‘rw’) that the Lakehouse will have over the object storage. A sample Instance-secret manifest is provided below:

    Sample Instance Secret manifest file
    name: depotsecret-r # Resource name (mandatory)
    version: v1 # Manifest version (mandatory)
    type: instance-secret # Resource-type (mandatory)
    tags: # Tags (optional)
      - just for practice
    description: instance secret configuration # Description of Resource (optional)
    layer: user
    instance-secret: # Instance Secret mapping (mandatory)
      type: key-value-properties # Type of Instance-secret (mandatory)
      acl: r # Access control list (mandatory)
      data: # Data section mapping (mandatory)
        username: iamgroot
        password: yourpassword
    

    You can refer to the following link to get the templates for the Instance-secret manifests for object stores.

  • Applying the Manifest file using DataOS CLI: Once your manifest file is ready, you can apply it using the DataOS Command Line Interface (CLI), by the following command:

    dataos-ctl resource apply -f ${manifest-file-path} -w ${workspace}
    
    dataos-ctl resource apply -f data_product/instance_secret.yaml -w curriculum
    

    Alternate command

    dataos-ctl apply -f ${manifest-file-path} -w ${workspace}
    
    dataos-ctl apply -f ../data_product/instance_secret.yaml -w curriculum
    
  • Verify Instance-secret creation: To ensure that your Instance-secret has been successfully created, you can verify it in two ways:

    Check the name of the newly created Instance-secret in the list of Instance-secret created by you using the resource get command:

    dataos-ctl resource get -t instance-secret
    

    Alternatively, retrieve the list of all Instance-secret created by all users in a DataOS instance by appending -a flag:

    dataos-ctl resource get -t instance-secret -a
    

    You can also access the details of any created Instance-secret through the DataOS GUI in the Resource tab of the Operations app.

For more information about Instance-secret, refer to the documentation: Instance-secret.

Draft a Lakehouse manifest file

Once you have created Instance-secrets, now its time to create a Lakehouse by applying the Lakehouse manifest file using the DataOS CLI. The Lakehouse manifest file is divided into several sections, each responsible for specifying different aspects of the Lakehouse. The sections are provided below:

  • Resource meta section
  • Lakehouse-specific section
    • Storage section
    • Metastore section
    • Query Engine section

A sample Lakehouse manifest file is provided below; the sections that make up the various parts of the manifest file are described after that.

Sample Lakehouse manifest file
# Resource-meta section (1)
name: alphaomega
version: v1alpha
type: lakehouse
tags:
  - Iceberg
  - Azure
description: Icebase depot of storage-type S3
owner: iamgroot
layer: user

# Lakehouse-specific section (2)
lakehouse:
  type: iceberg
  compute: runnable-default
  iceberg:

    # Storage section (3)
    storage:
      depotName: alphaomega
      type: s3
      s3:
        bucket: dataos-lakehouse   
        relativePath: /test
      secrets:
        - name: alphaomega-r
          keys:
            - alphaomega-r
          allkeys: true 
        - name: alphaomega-rw
          keys:
            - alphaomega-rw
          allkeys: true  

    # Metastore section (4)
    metastore:
      type: "iceberg-rest-catalog"

    # Query engine section (5)
    queryEngine:
      type: themis
  1. Resource meta section within a manifest file comprises metadata attributes universally applicable to all Resource-types. To learn more about how to configure attributes within this section, refer to the link: Attributes of Resource meta section.

  2. Lakehouse-specific section within a manifest file comprises attributes specific to the Lakehouse Resource. This section is further subdivided into: Storage, Metastore, and Query Engine section. To learn more about how to configure attributes of Lakehouse-specific section, refer the link: Attributes of Lakehouse-specific section.

  3. Storage section comprises attributes for storage configuration.

  4. Metastore section comprises attributes for metastore configuration.

  5. Query Engine section comprises attributes for query engine configuration.

Resource meta section

This section serves as the header of the manifest file, defining the overall characteristics of the Lakehouse Resource you wish to create. It includes attributes common to all types of Resources in DataOS. These attributes help DataOS in identifying, categorizing, and managing the Resource within its ecosystem. The code block below describes the attributes of this section:

# Resource-meta section
name: ${resource-name} # mandatory
version: v1alpha # mandatory
type: lakehouse # optional
tags: # optional
  - ${tag1}
  - ${tag2}
description: ${description} # optional
owner: ${userid-of-owner} # optional
layer: user # optional
# Resource-meta section
name: lakehouse-s3 # mandatory
version: v1alpha # mandatory
type: lakehouse # mandatory
tags: # optional
  - lakehouse
  - s3
description: The manifest file for Lakehouse Resource # optional
owner: iamgroot # optional
layer: user # optional

Refer to the Attributes of Resource meta section for more information about the various attributes in the Resource meta section.

Lakehouse-specific section

Following the Resource meta section, the Lakehouse-specific section contains configurations unique to the Lakehouse Resource.

lakehouse:
  type: ${lakehouse-type} # mandatory 
  compute: ${compute} # mandatory 
  runAsApiKey: ${dataos-apikey} # optional
  runAsUser: ${user-id} # optional
  iceberg: # mandatory
    storage: 
      # storage section attributes
    metaStore: 
      # metastore section attributes
    queryEngine: # 
      # query engine section attributes
lakehouse:
  type: iceberg # mandatory 
  compute: query-default # mandatory 
  runAsApiKey: abcdefghijklmnopqrstuvwxyz # optional
  runAsUser: iamgroot # optional
  iceberg: # mandatory
    storage: 
      # storage section attributes
    metaStore:
      # metastore section attributes
    queryEngine:
      # query engine section attributes
Attribute           Data Type Default Value Possible Value Requirement
lakehouse mapping none none mandatory
type string none iceberg mandatory
compute string none valid query-type Compute Resource name mandatory
runAsApiKey mapping api key of user applying the Lakehouse any valid DataOS apikey optional
runAsUser string user-id of owner user-id of use-case assignee optional
iceberg mapping none none mandatory
storage mapping none valid storage configuration mandatory
metaStore mapping none valid metastore configuration optional
queryEngine mapping none valid query engine configuration optional

This section is divided into three separate sections, each critical to the Lakehouse’s functionality:

  • Storage section
  • Metastore section
  • Query engine section

Storage section

This section of the Lakehouse manifest file specifies the connection to the underlying object storage solution (e.g., ABFSS, WASBS, Amazon S3, GCS). Instance-secrets enable the secure reference of sensitive data within the manifest. The Storage section's configurations facilitate the creation of a Depot, abstracting the storage setup and ensuring secured data access in the object storage solution. This setup varies across different source systems, as detailed in the tabs below:

To setup a Lakehouse on top of ABFSS source system, you need to configure the storage with type: abfss. The code block below elucidates the storage section configuration for 'ABFSS':

storage:
  depotName: ${depot-name} # optional
  type: abfss # mandatory
  abfss: # optional
    account: ${abfss-account} # optional
    container: ${container} # optional
    endpointSuffix: ${endpoint-suffix}
    format: ${format} # optional
    icebergCatalogType: ${iceberg-catalog-type} # optional
    metastoreType: ${metastore-type} # optional
    metastoreUrl: ${metastore-url} # optional
    relativePath: ${relative-path} # optional
  secrets:
    - name: ${referred-secret-name} # mandatory
      workspace: ${secret-workspace} # optional 
      key: ${secret-key} # optional
      keys: # optional 
        - ${key1}
        - ${key2}
      allKeys: ${all-keys-or-not} # optional
      consumptionType: ${consumption-type} # optional
storage:
  type: "abfss"
  abfss:
    depotName: abfsslakehouse
    account: abfssstorage
    container: lake01
    relativePath: "/dataos"
    format: ICEBERG
    endpointSuffix: dfs.core.windows.net
  secrets:
    - name: abfsslakehouse-rw
      keys:
        - abfsslakehouse-rw
      allkeys: true    
    - name: abfsslakehouse-r
      keys:
        - abfsslakehouse-r
      allkeys: true 

The table below summarizes the attributes of 'abfss' storage configuration:

Attribute                        Data Type Default Value Possible Value Requirement
storage mapping none none mandatory
depotName string ${lakehouse-name}0
${workspace}0
storage
A valid string that matches
the regex pattern
[a-z]([a-z0-9]*). Special
characters, except for
hyphens/dashes, are
not allowed. The maximum
length is 48 characters.
optional
type string none abfss mandatory
abfss mapping none none optional
account string none valid ABFSS account optional
container string none valid container name optional
endpointSuffix string none valid endpoint suffix optional
format string Iceberg Iceberg optional
icebergCatalogType string hadoop hadoop, hive optional
metastoreType string iceberg-rest-catalog iceberg-rest-catalog optional
metastoreUrl string none valid URL optional
relativePath string none valid relative path optional
secret mapping none none mandatory
name string none valid Secret name mandatory
workspace string none valid Workspace name and
must be less than '32'
chars and conform to
the following regex:
[a-z]([-a-z0-9]*[a-z0-9])?
optional
key string none valid key optional
keys list of strings none valid keys optional
allKeys boolean false true/false optional
consumptionType string envVars envVars, propFile optional

Attributes of ABFSS storage configuration

To setup a Lakehouse on top of GCS source system, you need to configure the storage with type: gcs. The code block below elucidates the storage section configuration for 'GCS':

storage:
  depotName: ${depot-name} # optional
  type: gcs # mandatory
  gcs: # mandatory
    bucket: ${gcs-bucket} # mandatory
    format: ${format} # mandatory
    icebergCatalogType: ${iceberg-catalog-type} # optional
    metastoreType: ${metastore-type} # optional
    metastoreUrl: ${metastore-url} # optional
    relativePath: ${relative-path} # optional
  secrets:
    - name: ${referred-secret-name} # mandatory
      workspace: ${secret-workspace} # optional 
      key: ${secret-key} # optional
      keys: # optional 
        - ${key1}
        - ${key2}
      allKeys: ${all-keys-or-not} # optional
      consumptionType: ${consumption-type} # optional
storage:
  depotName: gcslakehouse
  type: gcs
  gcs:
    bucket: gcsbucket
    relativePath: "/sanity"
    format: iceberg      
  secrets:
    - name: gcslakehouse-rw
      keys:
        - gcslakehouse-rw
      allkeys: true    
    - name: gcslakehouse-r
      keys:
        - gcslakehouse-r
      allkeys: true 

The table below summarizes the attributes of 'gcs' storage configuration:

Attribute                        Data Type Default Value Possible Value Requirement
storage mapping none none mandatory
depotName string ${lakehouse-name}0
${workspace}0
storage
A valid string that matches
the regex pattern
[a-z]([a-z0-9]*). Special
characters, except for
hyphens/dashes, are
not allowed. The maximum
length is 48 characters.
optional
type string none gcs mandatory
gcs mapping none none optional
bucket string none valid GCS bucket name optional
format string Iceberg Iceberg optional
icebergCatalogType string none hadoop, hive optional
metastoreType string iceberg-rest-catalog iceberg-rest-catalog optional
metastoreUrl string none valid metastore URL optional
relativePath string none valid relative path optional
secret mapping none none mandatory
name string none valid Secret name mandatory
workspace string none valid Workspace name and
must be less than '32'
chars and conform to
the following regex:
[a-z]([-a-z0-9]*[a-z0-9])?
optional
key string none valid key optional
keys list of strings none valid keys optional
allKeys boolean false true/false optional
consumptionType string envVars envVars, propFile optional

Attributes of GCS storage configuration

To setup a Lakehouse on top of S3 source system, you need to configure the storage with type: s3. The code block below elucidates the storage section configuration for 'S3':

storage:
  depotName: ${depot-name} # optional
  type: s3 # mandatory
  s3: # mandatory
    bucket: ${s3-bucket} # mandatory
    format: ${format} # mandatory
    icebergCatalogType: ${iceberg-catalog-type} # optional
    metastoreType: ${metastore-type} # optional
    metastoreUrl: ${metastore-url} # optional
    relativePath: ${relative-path} # optional
    scheme: ${scheme} # optional
  secrets:
    - name: ${referred-secret-name} # mandatory
      workspace: ${secret-workspace} # optional 
      key: ${secret-key} # optional
      keys: # optional 
        - ${key1}
        - ${key2}
      allKeys: ${all-keys-or-not} # optional
      consumptionType: ${consumption-type} # optional
storage:
  depotName: s3test
  type: "s3"
  s3:
    bucket: lake001-dev        # "tmdc-dataos-testing"
    relativePath: /sanitys3       
  secrets:
    - name: s3test-rw
      keys:
        - s3test-rw
      allkeys: true    
    - name: s3test-r
      keys:
        - s3test-r
      allkeys: true 

The table below summarizes the attributes of 's3' storage configuration:

Attribute                        Data Type Default Value Possible Value Requirement
storage mapping none none mandatory
depotName string ${lakehouse-name}0
${workspace}0
storage
A valid string that matches
the regex pattern
[a-z]([a-z0-9]*). Special
characters, except for
hyphens/dashes, are
not allowed. The maximum
length is 48 characters.
optional
type string none s3 mandatory
s3 mapping none none optional
bucket string none valid S3 bucket name optional
format string Iceberg Iceberg optional
icebergCatalogType string none hadoop, hive optional
metastoreType string iceberg-rest-catalog iceberg-rest-catalog optional
metastoreUrl string none valid URL optional
relativePath string none valid relative path optional
scheme string none valid scheme (e.g., s3://) optional
secret mapping none none mandatory
name string none valid Secret name mandatory
workspace string none valid Workspace name and
must be less than '32'
chars and conform to
the following regex:
[a-z]([-a-z0-9]*[a-z0-9])?
optional
key string none valid key optional
keys list of strings none valid keys optional
allKeys boolean false true/false optional
consumptionType string envVars envVars, propFile optional

Attributes of S3 storage configuration

To setup a Lakehouse on top of WASBS source system, you need to configure the storage with type: wasbs. The code block below elucidates the storage section configuration for 'WASBS':

storage:
  depotName: ${depot-name} # optional
  type: wasbs # mandatory
  wasbs: # optional
    account: ${abfss-account} # optional
    container: ${container} # optional
    endpointSuffix: ${endpoint-suffix}
    format: ${format} # optional
    icebergCatalogType: ${iceberg-catalog-type} # optional
    metastoreType: ${metastore-type} # optional
    metastoreUrl: ${metastore-url} # optional
    relativePath: ${relative-path} # optional
  secrets:
    - name: ${referred-secret-name} # mandatory
      workspace: ${secret-workspace} # optional 
      key: ${secret-key} # optional
      keys: # optional 
        - ${key1}
        - ${key2}
      allKeys: ${all-keys-or-not} # optional
      consumptionType: ${consumption-type} # optional
storage:
  type: "wasbs"
  wasbs:
    depotName: wasbslakehouse
    account: wasbsstorage
    container: lake01
    relativePath: "/dataos"
    format: ICEBERG
    endpointSuffix: dfs.core.windows.net
  secrets:
    - name: wasbslakehouse-rw
      keys:
        - wasbslakehouse-rw
      allkeys: true    
    - name: wasbslakehouse-r
      keys:
        - wasbslakehouse-r
      allkeys: true 

The table below summarizes the attributes of 'wasbs' storage configuration:

Attribute                        Data Type Default Value Possible Value Requirement
storage mapping none none mandatory
depotName string ${lakehouse-name}0
${workspace}0
storage
A valid string that matches
the regex pattern
[a-z]([a-z0-9]*). Special
characters, except for
hyphens/dashes, are
not allowed. The maximum
length is 48 characters.
optional
type string none wasbs mandatory
wasbs mapping none none optional
account string none valid ABFSS account optional
container string none valid container name optional
endpointSuffix string none valid endpoint suffix optional
format string Iceberg Iceberg optional
icebergCatalogType string hadoop hadoop, hive optional
metastoreType string iceberg-rest-catalog iceberg-rest-catalog optional
metastoreUrl string none valid URL optional
relativePath string none valid relative path optional
secret mapping none none mandatory
name string none valid Secret name mandatory
workspace string none valid Workspace name and
must be less than '32'
chars and conform to
the following regex:
[a-z]([-a-z0-9]*[a-z0-9])?
optional
key string none valid key optional
keys list of strings none valid keys optional
allKeys boolean false true/false optional
consumptionType string envVars envVars, propFile optional

Attributes of WASBS storage configuration

Metastore section

This section outlines the metastore configuration, which manages metadata for the data stored in the Lakehouse storage. It includes the metastore service type and detailed setup instructions.

Configurations range from simple, requiring just the metastore type (e.g., iceberg-rest-catalog), to complex, incorporating additional features for enhanced scalability and performance. Advanced configurations may detail the number of replicas, autoscaling capabilities, and specific resource allocations.

lakehouse:
  metastore:
    type: ${metasatore-type}
lakehouse:
  metastore:
    type: iceberg-rest-catalog

The table below elucidates the basic configuration attributes of Metastore section:

Attribute Data Type Default Value Possible Value Requirement
metastore mapping none none optional
type string none iceberg-rest-catalog mandatory

Basic configuration attributes of Metastore section

metastore:
  type: ${metastore-type} # mandatory
  replicas: ${number-of-replicas}
  autoScaling:
    enabled: ${enable-autoscaling}
    minReplicas: ${minimum-number-of-replicas}
    maxReplicas: ${maximum-number-of-replicas}
    targetMemoryUtilizationPercentage: ${target-memory-utilization-percentage}
    targetCPUUtilizationPercentage: ${target-cpu-utilization-percentage}
  resources:
    requests:
      cpu: ${requested-cpu-resource}
      memory: ${requested-memory-resource}
    limits:
      cpu: ${requested-cpu-resource}
      memory: ${requested-memory-resource}
metastore:
  type: iceberg-rest-catalog # mandatory
  replicas: 2
  autoScaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 4
    targetMemoryUtilizationPercentage: 60
    targetCPUUtilizationPercentage: 60
  resources:
    requests:
      cpu: 1Gi
      memory: 400m
    limits:
      cpu: 2Gi
      memory: 1000m

The table below elucidates the basic configuration attributes of Metastore section:

Attribute Data Type Default Value Possible Value Requirement
metastore mapping none none optional
type string none iceberg-rest-catalog mandatory
replicas integer none any valid positive integer optional
autoscaling mapping none none optional
enabled boolean false true/false optional
minReplicas integer none any valid integer optional
maxReplicas integer none any valid integer greater than minReplicas optional
targetMemoryUtilizationPercentage integer none any valid percentage optional
targetCPUUtilizationPercentage integer none any valid percentage optional
resources mapping none none optional
requests mapping none none optional
limits mapping none none optional
cpu string none any valid resource amount optional
memory string none any valid resource amount optional

Advanced configuration attributes of Metastore section

Query Engine section

The query engine section facilitates the creation of a Cluster Resource, enabling data queries against the Lakehouse storage. Currently, only the Themis query engine is supported.

Basic configurations might be adequate for standard use cases, outlining merely the type of query engine. For environments demanding more precise resource management, advanced configurations offer customization options, including specific CPU and memory requests and limits, to ensure the query engine operates efficiently within set resource constraints.

queryEngine:
  type: ${query-engine-type}
queryEngine:
  type: themis

The table below elucidates the basic configuration attributes of Metastore section:

Attribute Data Type Default Value Possible Value Requirement
queryEngine mapping none none optional
type string none themis mandatory

Basic configuration attributes of Query Engine section

queryEngine:
  type: ${query-engine-type} # mandatory
  resources:
    requests:
      cpu: ${requested-cpu-resource}
      memory: ${requested-memory-resource}
    limits:
      cpu: ${requested-cpu-resource}
      memory: ${requested-memory-resource}
  themis:
    envs:
      ${environment-variables}
    themisConf:
      ${themis-configuration}
    spark:
      driver: 
        resources:
          requests:
            cpu: ${requested-cpu-resource}
            memory: ${requested-memory-resource}
          limits:
            cpu: ${requested-cpu-resource}
            memory: ${requested-memory-resource}
        instanceCount: ${instance-count} # mandatory
        maxInstanceCount: ${max-instance-count} # mandatory
      executor:
        resources:
          requests:
            cpu: ${requested-cpu-resource}
            memory: ${requested-memory-resource}
          limits:
            cpu: ${requested-cpu-resource}
            memory: ${requested-memory-resource}
        instanceCount: ${instance-count} # mandatory
        maxInstanceCount: ${max-instance-count} # mandatory
      sparkConf:
        ${spark-configuration}
  storageAcl: ${storage-acl} # mandatory
queryEngine:
  type: themis # mandatory
  resources:
    requests:
      cpu: 1000m
      memory: 2Gi
    limits:
      cpu: 2000m
      memory: 4Gi
  themis:
    envs:
      alpha: beta
    themisConf:
      "kyuubi.frontend.thrift.binary.bind.host": "0.0.0.0"
      "kyuubi.frontend.thrift.binary.bind.port": "10101"
    spark:
      driver: 
        resources:
          requests:
            cpu: 1Gi
            memory: 400m
          limits:
            cpu: 2Gi
            memory: 1000m
        instanceCount: 2 # mandatory
        maxInstanceCount: 3 # mandatory
      executor:
        resources:
          requests:
            cpu: 1Gi
            memory: 400m
          limits:
            cpu: 2Gi
            memory: 1000m
        instanceCount: 2 # mandatory
        maxInstanceCount: 3 # mandatory
      sparkConf:
        spark.dynamicAllocation.enabled: true
  storageAcl: r # mandatory

The table below elucidates the advanced configuration attributes of Query Engine section:

Attribute Data Type Default Value Possible Value Requirement
queryEngine mapping none none mandatory
type string none themis mandatory
resources mapping none none optional
requests mapping none none optional
cpu string none any valid CPU resource amount optional
memory string none any valid memory resource amount optional
limits mapping none none optional
cpu string none any valid CPU resource limit optional
memory string none any valid memory resource limit optional
themis mapping none none optional
envs mapping none none optional
themisConf mapping none none optional
spark mapping none none mandatory
driver mapping none none mandatory
memory string none any valid memory amount mandatory
cpu string none any valid CPU resource mandatory
executor mapping none none mandatory
memory string none any valid memory amount mandatory
cpu string none any valid CPU resource mandatory
instanceCount integer none any valid integer mandatory
maxInstanceCount integer none any valid integer mandatory
sparkConf mapping none none optional

Advanced configuration attributes of Query Engine section

Apply the Lakehouse manifest

After creating the manifest file for the Lakehouse Resource, it's time to apply it to instantiate the Resource-instance in the DataOS environment. To apply the Lakehouse manifest file, utilize the apply command.

dataos-ctl apply -f ${manifest-file-path} - w ${workspace}
dataos-ctl apply -f dataproducts/new-lakehouse.yaml -w curriculum

The links provided below showcase the process of creating Lakehouse for a particular data source:

Managing a Lakehouse

Verify Lakehouse Creation

To ensure that your Lakehouse has been successfully created, you can verify it in two ways:

Check the name of the newly created Lakehouse in the list of lakehouses created by you in a particular Workspace:

dataos-ctl get -t lakehouse - w ${workspace name}

Sample

dataos-ctl get -t lakehouse -w curriculum

Alternatively, retrieve the list of all Lakehouses created in the Workspace by appending -a flag:

dataos-ctl get -t lakehouse -w ${workspace name} -a
# Sample
dataos-ctl get -t lakehouse -w curriculum

You can also access the details of any created Lakehouse through the DataOS GUI in the Resource tab of the  Operations app.

Deleting a Lakehouse

Use the delete command to remove the specific Lakehouse Resource Instance from the DataOS environment. As shown below, there are three ways to delete a Lakehouse.

Method 1: Copy the Lakehouse name, version, Resource-type and Workspace name from the output of the get command seperated by '|' enclosed within quotes and use it as a string in the delete command.

Command

dataos-ctl delete -i "${identifier string}"

Example

dataos-ctl delete -i "cnt-lakehouse-demo-01 | v1alpha | lakehouse | public"

Output

INFO[0000] 🗑 delete...
INFO[0001] 🗑 deleting(public) cnt-lakehouse-demo-01:v1alpha:lakehouse...
INFO[0003] 🗑 deleting(public) cnt-lakehouse-demo-01:v1alpha:lakehouse...deleted
INFO[0003] 🗑 delete...complete

Method 2: Specify the path of the YAML file and use the delete command.

Command

dataos-ctl delete -f ${manifest-file-path}

Example

dataos-ctl delete -f /home/desktop/connect-city/config_v1alpha.yaml

Output

INFO[0000] 🗑 delete...
INFO[0000] 🗑 deleting(public) cnt-lakehouse-demo-010:v1alpha:lakehouse...
INFO[0001] 🗑 deleting(public) cnt-lakehouse-demo-010:v1alpha:lakehouse...deleted
INFO[0001] 🗑 delete...complete

Method 3: Specify the Workspace, Resource-type, and Lakehouse name in the delete command.

Command

dataos-ctl delete -w ${workspace} -t lakehouse -n ${lakehouse name}

Example

dataos-ctl delete -w public -t lakehouse -n cnt-product-demo-01

Output

INFO[0000] 🗑 delete...
INFO[0000] 🗑 deleting(public) cnt-city-demo-010:v1alpha:lakehouse...
INFO[0001] 🗑 deleting(public) cnt-city-demo-010:v1alpha:lakehouse...deleted
INFO[0001] 🗑 delete...complete

How to configure the manifest file of Lakehouse?

The Attributes of Lakehouse manifest define the key properties and configurations that can be used to specify and customize Lakehouse Resources within a manifest file. These attributes allow data developers to define the structure and behavior of their Lakehouse Resources. For comprehensive information on each attribute and its usage, please refer to the link: Attributes of Lakehouse manifest.

How to manage Lakehouse Resource and datasets using CLI?

This section provides a comprehensive guide for managing Lakehouse Resource and inspecting datasets stored in Lakehouse storage. Utilizing the dataset command, users can perform a wide array of Data Definition Language (DDL)-related tasks, streamlining operations such as adding or removing columns, editing dataset metadata, and listing snapshots, among others. To learn more about these commands, refer to the link: Lakehouse Command Reference.

How to use a Lakehouse in DataOS?