Lakehouse¶
Lakehouse is a DataOS Resource that merges Apache Iceberg table format with cloud object storage, yielding a fully managed storage architecture that blends the strengths of data lakes and data warehouses. It enables a novel approach to system design, incorporating features typically found in data warehouses—such as the creation of tables with defined schemas, data manipulation using a variety of tools, and sophisticated data management capabilities—directly on top of cost-effective cloud storage in open formats.
-
How to create and manage a Lakehouse?
Learn how to create and manage a Lakehouse in DataOS.
-
How to configure the manifest file of Lakehouse?
Discover how to configure the manifest file of a Lakehouse by adjusting its attributes.
-
How to manage datasets in a Lakehouse?
Various CLI commands related to performing DDL/DML operations on datasets in a Lakehouse.
-
How to use a Lakehouse in DataOS?
Explore examples showcasing the usage of Lakehouse Resource in various scenarios.
Key Features of a Lakehouse¶
The DataOS Lakehouse integrates essential features of Relational Data Warehouses with the scalability and adaptability of data lakes. Here's an outline of its core features:
- Decoupled Storage from Compute: The Lakehouse architecture decouples storage from computational resources, permitting independent scaling. This enables handling larger datasets and more simultaneous users efficiently.
- ACID Transactions Support: Essential for data integrity during simultaneous accesses, ACID transaction support ensures consistent and reliable data amidst concurrent operations.
- Versatile Workload Management: Designed to facilitate a range of tasks from analytics to machine learning, the Lakehouse serves as a unified repository, streamlining data management.
- Flexible Computing Environments: Supports a diverse array of cloud-native storage and processing environments, including DataOS native stacks like Flare, Soda, etc.
- Openness and Standardization: Embracing open file formats like Parquet ensures efficient data retrieval across various tools and platforms.
- Branching Capabilities: Employs Iceberg's branching features to support schema versioning and experimentation, enabling safe testing and iteration without affecting live data.
Architecture of a Lakehouse¶
DataOS Lakehouse architecture comprises of several layers that come together to form a cohesive environment for data management. The layers are described below:
- Storage: Acts as the foundational storage layer, interfacing with the cloud storage services (e.g., GCS, ABFSS, WASBS, Amazon S3). It abstracts out the storage connection details by creating a Depot Resource on applying, while the credentials are securely referred using Instance Secrets. The Lakehouse storage, utilizes Parquet for efficiently handling large datasets, and employs the Iceberg format for table metadata management.
- Metastore: Facilitates access to metadata related to the stored data through the utilization of the Iceberg REST metastore. It exposes Iceberg catalogs, e.g. Hadoop and Hive, via REST metastore interfaces, thus facilitating metadata management.
- Query Engine: Provides the computing environment for running data queries and analytics. It supports the Themis Query Engine which is provisioned through the Cluster Resource.
Together, these layers come together to form the DataOS Lakehouse architecture, ensuring it not only serves as a repository for vast amounts of data but also as a powerful component for data analysis and insights.
How to create and manage a Lakehouse?¶
Prerequisites¶
Before proceeding with the Lakehouse creation, ensure the following prerequisites are met:
Object Storage account
Data developers need access to an object storage solution. Ensure you have the storage credentials ready with 'Storage Admin' access level. The following object storage solutions are supported:
- Azure Blob File System Storage (ABFSS)
- Windows Azure Storage Blob Service (WASBS)
- Amazon Simple Storage Service (Amazon S3)
- Google Cloud Storage (GCS)
Access level permission
To set up a Lakehouse in DataOS, besides possessing an object storage account with appropriate permissions, you also require specific tags or use-cases that authorize you to create and manage a Lakehouse within DataOS.
Creating a Lakehouse¶
Create Instance Secrets¶
Instance Secrets are vital for securely storing sensitive information like data source credentials. These Instance-secrets ensure that credentials are kept safe in the Heimdall vault, making them accessible throughout the DataOS instance without exposing them directly in your Lakehouse manifest file. Here’s how you can create Instance-secrets:
Steps to Create Instance Secrets
-
Prepare the manifest file for Instance-secret: You need to create a manifest file (YAML configuration file) that contains the source credentials for your chosen object storage solution (ABFSS, WASBS, Amazon S3, or GCS). This file should also specify the level of access control (read-only ‘r’ or read-write ‘rw’) that the Lakehouse will have over the object storage. A sample Instance-secret manifest is provided below:
Sample Instance Secret manifest file
name: depotsecret-r # Resource name (mandatory) version: v1 # Manifest version (mandatory) type: instance-secret # Resource-type (mandatory) tags: # Tags (optional) - just for practice description: instance secret configuration # Description of Resource (optional) layer: user instance-secret: # Instance Secret mapping (mandatory) type: key-value-properties # Type of Instance-secret (mandatory) acl: r # Access control list (mandatory) data: # Data section mapping (mandatory) username: iamgroot password: yourpassword
You can refer to the following link to get the templates for the Instance-secret manifests for object stores.
-
Applying the Manifest file using DataOS CLI: Once your manifest file is ready, you can apply it using the DataOS Command Line Interface (CLI), by the following command:
Alternate command
-
Verify Instance-secret creation: To ensure that your Instance-secret has been successfully created, you can verify it in two ways:
Check the name of the newly created Instance-secret in the list of Instance-secret created by you using the
resource get
command:Alternatively, retrieve the list of all Instance-secret created by all users in a DataOS instance by appending
-a
flag:You can also access the details of any created Instance-secret through the DataOS GUI in the Resource tab of the Operations app.
For more information about Instance-secret, refer to the documentation: Instance-secret.
Draft a Lakehouse manifest file¶
Once you have created Instance-secrets, now its time to create a Lakehouse by applying the Lakehouse manifest file using the DataOS CLI. The Lakehouse manifest file is divided into several sections, each responsible for specifying different aspects of the Lakehouse. The sections are provided below:
- Resource meta section
- Lakehouse-specific section
- Storage section
- Metastore section
- Query Engine section
A sample Lakehouse manifest file is provided below; the sections that make up the various parts of the manifest file are described after that.
Sample Lakehouse manifest file
# Resource-meta section (1)
name: alphaomega
version: v1alpha
type: lakehouse
tags:
- Iceberg
- Azure
description: Icebase depot of storage-type S3
owner: iamgroot
layer: user
# Lakehouse-specific section (2)
lakehouse:
type: iceberg
compute: runnable-default
iceberg:
# Storage section (3)
storage:
depotName: alphaomega
type: s3
s3:
bucket: dataos-lakehouse
relativePath: /test
secrets:
- name: alphaomega-r
keys:
- alphaomega-r
allkeys: true
- name: alphaomega-rw
keys:
- alphaomega-rw
allkeys: true
# Metastore section (4)
metastore:
type: "iceberg-rest-catalog"
# Query engine section (5)
queryEngine:
type: themis
-
Resource meta section within a manifest file comprises metadata attributes universally applicable to all Resource-types. To learn more about how to configure attributes within this section, refer to the link: Attributes of Resource meta section.
-
Lakehouse-specific section within a manifest file comprises attributes specific to the Lakehouse Resource. This section is further subdivided into: Storage, Metastore, and Query Engine section. To learn more about how to configure attributes of Lakehouse-specific section, refer the link: Attributes of Lakehouse-specific section.
-
Storage section comprises attributes for storage configuration.
-
Metastore section comprises attributes for metastore configuration.
-
Query Engine section comprises attributes for query engine configuration.
Resource meta section
This section serves as the header of the manifest file, defining the overall characteristics of the Lakehouse Resource you wish to create. It includes attributes common to all types of Resources in DataOS. These attributes help DataOS in identifying, categorizing, and managing the Resource within its ecosystem. The code block below describes the attributes of this section:
Refer to the Attributes of Resource meta section for more information about the various attributes in the Resource meta section.
Lakehouse-specific section
Following the Resource meta section, the Lakehouse-specific section contains configurations unique to the Lakehouse Resource.
lakehouse:
type: ${lakehouse-type} # mandatory
compute: ${compute} # mandatory
runAsApiKey: ${dataos-apikey} # optional
runAsUser: ${user-id} # optional
iceberg: # mandatory
storage:
# storage section attributes
metaStore:
# metastore section attributes
queryEngine: #
# query engine section attributes
lakehouse:
type: iceberg # mandatory
compute: query-default # mandatory
runAsApiKey: abcdefghijklmnopqrstuvwxyz # optional
runAsUser: iamgroot # optional
iceberg: # mandatory
storage:
# storage section attributes
metaStore:
# metastore section attributes
queryEngine:
# query engine section attributes
Attribute | Data Type | Default Value | Possible Value | Requirement |
---|---|---|---|---|
lakehouse |
mapping | none | none | mandatory |
type |
string | none | iceberg | mandatory |
compute |
string | none | valid query-type Compute Resource name | mandatory |
runAsApiKey |
mapping | api key of user applying the Lakehouse | any valid DataOS apikey | optional |
runAsUser |
string | user-id of owner | user-id of use-case assignee | optional |
iceberg |
mapping | none | none | mandatory |
storage |
mapping | none | valid storage configuration | mandatory |
metaStore |
mapping | none | valid metastore configuration | optional |
queryEngine |
mapping | none | valid query engine configuration | optional |
This section is divided into three separate sections, each critical to the Lakehouse’s functionality:
- Storage section
- Metastore section
- Query engine section
Storage section
This section of the Lakehouse manifest file specifies the connection to the underlying object storage solution (e.g., ABFSS, WASBS, Amazon S3, GCS). Instance-secrets enable the secure reference of sensitive data within the manifest. The Storage section's configurations facilitate the creation of a Depot, abstracting the storage setup and ensuring secured data access in the object storage solution. This setup varies across different source systems, as detailed in the tabs below:
To setup a Lakehouse on top of ABFSS source system, you need to configure the storage
with type: abfss
. The code block below elucidates the storage section configuration for 'ABFSS':
storage:
depotName: ${depot-name} # optional
type: abfss # mandatory
abfss: # optional
account: ${abfss-account} # optional
container: ${container} # optional
endpointSuffix: ${endpoint-suffix}
format: ${format} # optional
icebergCatalogType: ${iceberg-catalog-type} # optional
metastoreType: ${metastore-type} # optional
metastoreUrl: ${metastore-url} # optional
relativePath: ${relative-path} # optional
secrets:
- name: ${referred-secret-name} # mandatory
workspace: ${secret-workspace} # optional
key: ${secret-key} # optional
keys: # optional
- ${key1}
- ${key2}
allKeys: ${all-keys-or-not} # optional
consumptionType: ${consumption-type} # optional
storage:
type: "abfss"
abfss:
depotName: abfsslakehouse
account: abfssstorage
container: lake01
relativePath: "/dataos"
format: ICEBERG
endpointSuffix: dfs.core.windows.net
secrets:
- name: abfsslakehouse-rw
keys:
- abfsslakehouse-rw
allkeys: true
- name: abfsslakehouse-r
keys:
- abfsslakehouse-r
allkeys: true
The table below summarizes the attributes of 'abfss' storage configuration:
Attribute | Data Type | Default Value | Possible Value | Requirement |
---|---|---|---|---|
storage |
mapping | none | none | mandatory |
depotName |
string | ${lakehouse-name}0 ${workspace}0 storage |
A valid string that matches the regex pattern [a-z]([a-z0-9]*) . Special characters, except for hyphens/dashes, are not allowed. The maximum length is 48 characters. |
optional |
type |
string | none | abfss | mandatory |
abfss |
mapping | none | none | optional |
account |
string | none | valid ABFSS account | optional |
container |
string | none | valid container name | optional |
endpointSuffix |
string | none | valid endpoint suffix | optional |
format |
string | Iceberg | Iceberg | optional |
icebergCatalogType |
string | hadoop | hadoop, hive | optional |
metastoreType |
string | iceberg-rest-catalog | iceberg-rest-catalog | optional |
metastoreUrl |
string | none | valid URL | optional |
relativePath |
string | none | valid relative path | optional |
secret |
mapping | none | none | mandatory |
name |
string | none | valid Secret name | mandatory |
workspace |
string | none | valid Workspace name and must be less than '32' chars and conform to the following regex: [a-z]([-a-z0-9]*[a-z0-9])? |
optional |
key |
string | none | valid key | optional |
keys |
list of strings | none | valid keys | optional |
allKeys |
boolean | false | true/false | optional |
consumptionType |
string | envVars | envVars, propFile | optional |
Attributes of ABFSS storage configuration
To setup a Lakehouse on top of GCS source system, you need to configure the storage
with type: gcs
. The code block below elucidates the storage section configuration for 'GCS':
storage:
depotName: ${depot-name} # optional
type: gcs # mandatory
gcs: # mandatory
bucket: ${gcs-bucket} # mandatory
format: ${format} # mandatory
icebergCatalogType: ${iceberg-catalog-type} # optional
metastoreType: ${metastore-type} # optional
metastoreUrl: ${metastore-url} # optional
relativePath: ${relative-path} # optional
secrets:
- name: ${referred-secret-name} # mandatory
workspace: ${secret-workspace} # optional
key: ${secret-key} # optional
keys: # optional
- ${key1}
- ${key2}
allKeys: ${all-keys-or-not} # optional
consumptionType: ${consumption-type} # optional
The table below summarizes the attributes of 'gcs' storage configuration:
Attribute | Data Type | Default Value | Possible Value | Requirement |
---|---|---|---|---|
storage |
mapping | none | none | mandatory |
depotName |
string | ${lakehouse-name}0 ${workspace}0 storage |
A valid string that matches the regex pattern [a-z]([a-z0-9]*) . Special characters, except for hyphens/dashes, are not allowed. The maximum length is 48 characters. |
optional |
type |
string | none | gcs | mandatory |
gcs |
mapping | none | none | optional |
bucket |
string | none | valid GCS bucket name | optional |
format |
string | Iceberg | Iceberg | optional |
icebergCatalogType |
string | none | hadoop, hive | optional |
metastoreType |
string | iceberg-rest-catalog | iceberg-rest-catalog | optional |
metastoreUrl |
string | none | valid metastore URL | optional |
relativePath |
string | none | valid relative path | optional |
secret |
mapping | none | none | mandatory |
name |
string | none | valid Secret name | mandatory |
workspace |
string | none | valid Workspace name and must be less than '32' chars and conform to the following regex: [a-z]([-a-z0-9]*[a-z0-9])? |
optional |
key |
string | none | valid key | optional |
keys |
list of strings | none | valid keys | optional |
allKeys |
boolean | false | true/false | optional |
consumptionType |
string | envVars | envVars, propFile | optional |
Attributes of GCS storage configuration
To setup a Lakehouse on top of S3 source system, you need to configure the storage
with type: s3
. The code block below elucidates the storage section configuration for 'S3':
storage:
depotName: ${depot-name} # optional
type: s3 # mandatory
s3: # mandatory
bucket: ${s3-bucket} # mandatory
format: ${format} # mandatory
icebergCatalogType: ${iceberg-catalog-type} # optional
metastoreType: ${metastore-type} # optional
metastoreUrl: ${metastore-url} # optional
relativePath: ${relative-path} # optional
scheme: ${scheme} # optional
secrets:
- name: ${referred-secret-name} # mandatory
workspace: ${secret-workspace} # optional
key: ${secret-key} # optional
keys: # optional
- ${key1}
- ${key2}
allKeys: ${all-keys-or-not} # optional
consumptionType: ${consumption-type} # optional
The table below summarizes the attributes of 's3' storage configuration:
Attribute | Data Type | Default Value | Possible Value | Requirement |
---|---|---|---|---|
storage |
mapping | none | none | mandatory |
depotName |
string | ${lakehouse-name}0 ${workspace}0 storage |
A valid string that matches the regex pattern [a-z]([a-z0-9]*) . Special characters, except for hyphens/dashes, are not allowed. The maximum length is 48 characters. |
optional |
type |
string | none | s3 | mandatory |
s3 |
mapping | none | none | optional |
bucket |
string | none | valid S3 bucket name | optional |
format |
string | Iceberg | Iceberg | optional |
icebergCatalogType |
string | none | hadoop, hive | optional |
metastoreType |
string | iceberg-rest-catalog | iceberg-rest-catalog | optional |
metastoreUrl |
string | none | valid URL | optional |
relativePath |
string | none | valid relative path | optional |
scheme |
string | none | valid scheme (e.g., s3://) | optional |
secret |
mapping | none | none | mandatory |
name |
string | none | valid Secret name | mandatory |
workspace |
string | none | valid Workspace name and must be less than '32' chars and conform to the following regex: [a-z]([-a-z0-9]*[a-z0-9])? |
optional |
key |
string | none | valid key | optional |
keys |
list of strings | none | valid keys | optional |
allKeys |
boolean | false | true/false | optional |
consumptionType |
string | envVars | envVars, propFile | optional |
Attributes of S3 storage configuration
To setup a Lakehouse on top of WASBS source system, you need to configure the storage
with type: wasbs
. The code block below elucidates the storage section configuration for 'WASBS':
storage:
depotName: ${depot-name} # optional
type: wasbs # mandatory
wasbs: # optional
account: ${abfss-account} # optional
container: ${container} # optional
endpointSuffix: ${endpoint-suffix}
format: ${format} # optional
icebergCatalogType: ${iceberg-catalog-type} # optional
metastoreType: ${metastore-type} # optional
metastoreUrl: ${metastore-url} # optional
relativePath: ${relative-path} # optional
secrets:
- name: ${referred-secret-name} # mandatory
workspace: ${secret-workspace} # optional
key: ${secret-key} # optional
keys: # optional
- ${key1}
- ${key2}
allKeys: ${all-keys-or-not} # optional
consumptionType: ${consumption-type} # optional
storage:
type: "wasbs"
wasbs:
depotName: wasbslakehouse
account: wasbsstorage
container: lake01
relativePath: "/dataos"
format: ICEBERG
endpointSuffix: dfs.core.windows.net
secrets:
- name: wasbslakehouse-rw
keys:
- wasbslakehouse-rw
allkeys: true
- name: wasbslakehouse-r
keys:
- wasbslakehouse-r
allkeys: true
The table below summarizes the attributes of 'wasbs' storage configuration:
Attribute | Data Type | Default Value | Possible Value | Requirement |
---|---|---|---|---|
storage |
mapping | none | none | mandatory |
depotName |
string | ${lakehouse-name}0 ${workspace}0 storage |
A valid string that matches the regex pattern [a-z]([a-z0-9]*) . Special characters, except for hyphens/dashes, are not allowed. The maximum length is 48 characters. |
optional |
type |
string | none | wasbs | mandatory |
wasbs |
mapping | none | none | optional |
account |
string | none | valid ABFSS account | optional |
container |
string | none | valid container name | optional |
endpointSuffix |
string | none | valid endpoint suffix | optional |
format |
string | Iceberg | Iceberg | optional |
icebergCatalogType |
string | hadoop | hadoop, hive | optional |
metastoreType |
string | iceberg-rest-catalog | iceberg-rest-catalog | optional |
metastoreUrl |
string | none | valid URL | optional |
relativePath |
string | none | valid relative path | optional |
secret |
mapping | none | none | mandatory |
name |
string | none | valid Secret name | mandatory |
workspace |
string | none | valid Workspace name and must be less than '32' chars and conform to the following regex: [a-z]([-a-z0-9]*[a-z0-9])? |
optional |
key |
string | none | valid key | optional |
keys |
list of strings | none | valid keys | optional |
allKeys |
boolean | false | true/false | optional |
consumptionType |
string | envVars | envVars, propFile | optional |
Attributes of WASBS storage configuration
Metastore section
This section outlines the metastore configuration, which manages metadata for the data stored in the Lakehouse storage. It includes the metastore service type and detailed setup instructions.
Configurations range from simple, requiring just the metastore type (e.g., iceberg-rest-catalog
), to complex, incorporating additional features for enhanced scalability and performance. Advanced configurations may detail the number of replicas, autoscaling capabilities, and specific resource allocations.
The table below elucidates the basic configuration attributes of Metastore section:
Attribute | Data Type | Default Value | Possible Value | Requirement |
---|---|---|---|---|
metastore |
mapping | none | none | optional |
type |
string | none | iceberg-rest-catalog | mandatory |
Basic configuration attributes of Metastore section
metastore:
type: ${metastore-type} # mandatory
replicas: ${number-of-replicas}
autoScaling:
enabled: ${enable-autoscaling}
minReplicas: ${minimum-number-of-replicas}
maxReplicas: ${maximum-number-of-replicas}
targetMemoryUtilizationPercentage: ${target-memory-utilization-percentage}
targetCPUUtilizationPercentage: ${target-cpu-utilization-percentage}
resources:
requests:
cpu: ${requested-cpu-resource}
memory: ${requested-memory-resource}
limits:
cpu: ${requested-cpu-resource}
memory: ${requested-memory-resource}
The table below elucidates the basic configuration attributes of Metastore section:
Attribute | Data Type | Default Value | Possible Value | Requirement |
---|---|---|---|---|
metastore |
mapping | none | none | optional |
type |
string | none | iceberg-rest-catalog | mandatory |
replicas |
integer | none | any valid positive integer | optional |
autoscaling |
mapping | none | none | optional |
enabled |
boolean | false | true/false | optional |
minReplicas |
integer | none | any valid integer | optional |
maxReplicas |
integer | none | any valid integer greater than minReplicas |
optional |
targetMemoryUtilizationPercentage |
integer | none | any valid percentage | optional |
targetCPUUtilizationPercentage |
integer | none | any valid percentage | optional |
resources |
mapping | none | none | optional |
requests |
mapping | none | none | optional |
limits |
mapping | none | none | optional |
cpu |
string | none | any valid resource amount | optional |
memory |
string | none | any valid resource amount | optional |
Advanced configuration attributes of Metastore section
Query Engine section
The query engine section facilitates the creation of a Cluster Resource, enabling data queries against the Lakehouse storage. Currently, only the Themis query engine is supported.
Basic configurations might be adequate for standard use cases, outlining merely the type of query engine. For environments demanding more precise resource management, advanced configurations offer customization options, including specific CPU and memory requests and limits, to ensure the query engine operates efficiently within set resource constraints.
The table below elucidates the basic configuration attributes of Metastore section:
Attribute | Data Type | Default Value | Possible Value | Requirement |
---|---|---|---|---|
queryEngine |
mapping | none | none | optional |
type |
string | none | themis | mandatory |
Basic configuration attributes of Query Engine section
queryEngine:
type: ${query-engine-type} # mandatory
resources:
requests:
cpu: ${requested-cpu-resource}
memory: ${requested-memory-resource}
limits:
cpu: ${requested-cpu-resource}
memory: ${requested-memory-resource}
themis:
envs:
${environment-variables}
themisConf:
${themis-configuration}
spark:
driver:
resources:
requests:
cpu: ${requested-cpu-resource}
memory: ${requested-memory-resource}
limits:
cpu: ${requested-cpu-resource}
memory: ${requested-memory-resource}
instanceCount: ${instance-count} # mandatory
maxInstanceCount: ${max-instance-count} # mandatory
executor:
resources:
requests:
cpu: ${requested-cpu-resource}
memory: ${requested-memory-resource}
limits:
cpu: ${requested-cpu-resource}
memory: ${requested-memory-resource}
instanceCount: ${instance-count} # mandatory
maxInstanceCount: ${max-instance-count} # mandatory
sparkConf:
${spark-configuration}
storageAcl: ${storage-acl} # mandatory
queryEngine:
type: themis # mandatory
resources:
requests:
cpu: 1000m
memory: 2Gi
limits:
cpu: 2000m
memory: 4Gi
themis:
envs:
alpha: beta
themisConf:
"kyuubi.frontend.thrift.binary.bind.host": "0.0.0.0"
"kyuubi.frontend.thrift.binary.bind.port": "10101"
spark:
driver:
resources:
requests:
cpu: 1Gi
memory: 400m
limits:
cpu: 2Gi
memory: 1000m
instanceCount: 2 # mandatory
maxInstanceCount: 3 # mandatory
executor:
resources:
requests:
cpu: 1Gi
memory: 400m
limits:
cpu: 2Gi
memory: 1000m
instanceCount: 2 # mandatory
maxInstanceCount: 3 # mandatory
sparkConf:
spark.dynamicAllocation.enabled: true
storageAcl: r # mandatory
The table below elucidates the advanced configuration attributes of Query Engine section:
Attribute | Data Type | Default Value | Possible Value | Requirement |
---|---|---|---|---|
queryEngine |
mapping | none | none | mandatory |
type |
string | none | themis | mandatory |
resources |
mapping | none | none | optional |
requests |
mapping | none | none | optional |
cpu |
string | none | any valid CPU resource amount | optional |
memory |
string | none | any valid memory resource amount | optional |
limits |
mapping | none | none | optional |
cpu |
string | none | any valid CPU resource limit | optional |
memory |
string | none | any valid memory resource limit | optional |
themis |
mapping | none | none | optional |
envs |
mapping | none | none | optional |
themisConf |
mapping | none | none | optional |
spark |
mapping | none | none | mandatory |
driver |
mapping | none | none | mandatory |
memory |
string | none | any valid memory amount | mandatory |
cpu |
string | none | any valid CPU resource | mandatory |
executor |
mapping | none | none | mandatory |
memory |
string | none | any valid memory amount | mandatory |
cpu |
string | none | any valid CPU resource | mandatory |
instanceCount |
integer | none | any valid integer | mandatory |
maxInstanceCount |
integer | none | any valid integer | mandatory |
sparkConf |
mapping | none | none | optional |
Advanced configuration attributes of Query Engine section
Apply the Lakehouse manifest¶
After creating the manifest file for the Lakehouse Resource, it's time to apply it to instantiate the Resource-instance in the DataOS environment. To apply the Lakehouse manifest file, utilize the apply
command.
The links provided below showcase the process of creating Lakehouse for a particular data source:
- How to create a Lakehouse on ABFSS data source?
- How to create a Lakehouse on WASBS data source?
- How to create a Lakehouse on S3 data source?
- How to create a Lakehouse on GCS data source?
Managing a Lakehouse¶
Verify Lakehouse Creation¶
To ensure that your Lakehouse has been successfully created, you can verify it in two ways:
Check the name of the newly created Lakehouse in the list of lakehouses created by you in a particular Workspace:
Sample
Alternatively, retrieve the list of all Lakehouses created in the Workspace by appending -a
flag:
dataos-ctl get -t lakehouse -w ${workspace name} -a
# Sample
dataos-ctl get -t lakehouse -w curriculum
You can also access the details of any created Lakehouse through the DataOS GUI in the Resource tab of the Operations app.
Deleting a Lakehouse¶
Use the delete
command to remove the specific Lakehouse Resource Instance from the DataOS environment. As shown below, there are three ways to delete a Lakehouse.
Method 1: Copy the Lakehouse name, version, Resource-type and Workspace name from the output of the get
command seperated by '|' enclosed within quotes and use it as a string in the delete command.
Command
Example
Output
INFO[0000] 🗑 delete...
INFO[0001] 🗑 deleting(public) cnt-lakehouse-demo-01:v1alpha:lakehouse...
INFO[0003] 🗑 deleting(public) cnt-lakehouse-demo-01:v1alpha:lakehouse...deleted
INFO[0003] 🗑 delete...complete
Method 2: Specify the path of the YAML file and use the delete
command.
Command
Example
Output
INFO[0000] 🗑 delete...
INFO[0000] 🗑 deleting(public) cnt-lakehouse-demo-010:v1alpha:lakehouse...
INFO[0001] 🗑 deleting(public) cnt-lakehouse-demo-010:v1alpha:lakehouse...deleted
INFO[0001] 🗑 delete...complete
Method 3: Specify the Workspace, Resource-type, and Lakehouse name in the delete
command.
Command
Example
Output
INFO[0000] 🗑 delete...
INFO[0000] 🗑 deleting(public) cnt-city-demo-010:v1alpha:lakehouse...
INFO[0001] 🗑 deleting(public) cnt-city-demo-010:v1alpha:lakehouse...deleted
INFO[0001] 🗑 delete...complete
How to configure the manifest file of Lakehouse?¶
The Attributes of Lakehouse manifest define the key properties and configurations that can be used to specify and customize Lakehouse Resources within a manifest file. These attributes allow data developers to define the structure and behavior of their Lakehouse Resources. For comprehensive information on each attribute and its usage, please refer to the link: Attributes of Lakehouse manifest.
How to manage Lakehouse Resource and datasets using CLI?¶
This section provides a comprehensive guide for managing Lakehouse Resource and inspecting datasets stored in Lakehouse storage. Utilizing the dataset
command, users can perform a wide array of Data Definition Language (DDL)-related tasks, streamlining operations such as adding or removing columns, editing dataset metadata, and listing snapshots, among others. To learn more about these commands, refer to the link: Lakehouse Command Reference.
How to use a Lakehouse in DataOS?¶
- How to ensure high data quality in Lakehouse Storage using the Write-Audit-Publish pattern?
- Iceberg Metadata Tables in Lakehouse
- How to use Iceberg metadata tables to extract insights in Lakehouse storage?
- How to create, fetch, and drop dataset in a Lakehouse using CLI commands?
- How to perform Iceberg dataset maintainence in a Lakehouse using CLI commands?
- How to perform partitioning on Lakehouse datasets using CLI commands?
- How to perform schema evolution on Lakehouse datasets using CLI commands?
- How to manipulate table properties of Lakehouse datasets using CLI commands?