Delta Lake¶
Nilus supports Delta Lake as a batch ingestion source, enabling you to bring structured datasets from Azure Blob Storage, Amazon S3, or Google Cloud Storage (GCS) into the DataOS Lakehouse or other supported destinations.
Info
Delta Lake is only supported via DataOS Depot. Direct connection URIs are not supported.
Prerequisites¶
The following are the requirements for enabling Batch Data Movement in Delta Lake:
Storage Prerequisites¶
The following are the storage requirements for different cloud providers:
- Azure Blob Storage: Existing container, proper RBAC permissions, optional custom endpoint configuration, storage firewall rules
- Amazon S3: Existing bucket, IAM roles/policies, optional VPC endpoint configuration, bucket versioning recommended
- Google Cloud Storage: Existing bucket, IAM roles, optional VPC service controls, optional CMEK for encryption
Required Parameters¶
The following are the parameters needed:
container
/bucket
: Storage container (Azure) or bucket (S3/GCS)path
: Optional relative path inside the container/bucketcollection
: Collection namedataset
: Dataset name
Authentication¶
The following are the authentication methods for each storage type:
- Azure Blob Storage: Storage Account Name + Key (optional: endpoint suffix)
- Amazon S3: AWS Access Key ID + Secret Key + Region (optional: custom endpoint)
- Google Cloud Storage: HMAC credentials (Access Key + Secret Key) or Service Account JSON
Pre-created Depot¶
A Depot must exist in DataOS with read-write access. To check the Depot, go to the Metis UI of the DataOS or use the following command:
dataos-ctl resource get -t depot -a
#Expected Output
NFO[0000] 🔍 get...
INFO[0000] 🔍 get...complete
| NAME | VERSION | TYPE | STATUS | OWNER |
| ------------ | ------- | ----- | ------ | -------- |
| synapsedepot | v2alpha | depot | active | usertest |
If the Depot is not created, use the following manifest configuration template to create the Delta Lake Depot:
Delta Lake Depot Manifest
The following manifest is for Amazon S3 Delta Lake:
name: ${{depot-name}}
version: v2alpha
type: depot
description: ${{description}}
tags:
- ${{tag1}}
owner: ${{owner-name}}
layer: user
depot:
type: S3
external: ${{true}}
secrets:
- name: ${{s3-instance-secret-name}}-r
allkeys: true
- name: ${{s3-instance-secret-name}}-rw
allkeys: true
s3:
scheme: ${{s3a}}
bucket: ${{project-name}}
relativePath: ${{relative-path}}
format: delta
region: ${{us-gov-east-1}}
endpoint: ${{s3.us-gov-east-1.amazonaws.com}}
Info
- Update variables such as
name
,owner
,compute
,scheme
,bucket
format
region
etc., and contact the DataOS Administrator or Operator to obtain the appropriate secret name. - DataOS supports the Delta table format on the following object storage Depots:
Amazon S3, Azure Blob File System Secure (ABFSS), Google Cloud Storage (GCS), Windows Azure Storage Blob Service (WASBS).
Sample Workflow Config¶
name: nb-deltalake-test-01
version: v1
type: workflow
tags:
- workflow
- nilus-batch
description: Nilus Batch Workflow Sample for Delta Lake (S3) to DataOS Lakehouse
workspace: research
workflow:
dag:
- name: nb-job-01
spec:
stack: nilus:1.0
compute: runnable-default
logLevel: INFO
resources:
requests:
cpu: 500m
memory: 512Mi
stackSpec:
source:
address: dataos://delta_depot
options:
source-table: analytics.events
sink:
address: dataos://testawslh
options:
dest-table: delta_retail.batch_events
incremental-strategy: replace
aws_region: us-west-2
Info
Ensure that all placeholder values and required fields (e.g., connection addresses
, compute
, and access credentials) are properly updated before applying the configuration to a DataOS workspace.
Deploy the manifest file using the following command:
Supported Attribute Details¶
Nilus supports the following source options for Delta Lake:
Option | Required | Description |
---|---|---|
source-table |
Yes | Format: collection.dataset |