GCP-backed DataOS Lakehouse¶
The DataOS Lakehouse (GCP-backed) provides a secure, scalable, and cloud-native data storage and analytics layer built on Google Cloud Storage (GCS), using Apache Iceberg or Delta Lake as table formats.
Nilus supports the Lakehouse as both a source and a sink, enabling efficient batch data movement to and from the DataOS ecosystem.
Connections to the GCP Lakehouse are managed only through DataOS Depot, which centralizes authentication and credentials by offering UDL as:
Prerequisites¶
Following are the requirements for enabling Batch Data Movement in GCP-backed DataOS Lakehouse:
Environment Variables¶
For GCP-backed Lakehouse, the following environment variables must be set (via Depot or workflow):
Variable | Description |
---|---|
TYPE |
Must be set to GCS |
DESTINATION_BUCKET |
GCS URL in format gs://<bucket>/<path> |
GCS_CLIENT_EMAIL |
Service account email |
GCS_PROJECT_ID |
GCP project ID |
GCS_PRIVATE_KEY |
Service account private key |
GCS_JSON_KEY_FILE_PATH |
Path to service account JSON key file |
METASTORE_URL |
(Optional) External metastore URL |
Info
Contact the DataOS Administrator or Operator to obtain the configured Depot UDL.
Authentication Method¶
Nilus supports authentication via HMAC credentials for GCP:
HMAC Credentials(Hash-based Message Authentication Code)
GCS_ACCESS_KEY_ID
: HMAC key IDGCS_SECRET_ACCESS_KEY
: HMAC secret- Useful for S3-compatible access scenarios.
Required GCP Setup¶
- GCS Bucket
- Create the target bucket.
- Configure access control (IAM roles, ACLs).
- Enable versioning and lifecycle management as needed.
- Service Account
- Create a service account.
- Generate JSON key file.
- Assign required roles:
roles/storage.objectViewer
roles/storage.objectCreator
roles/storage.admin
(if managing bucket metadata)
- Security
- Configure IAM policies.
- Rotate keys regularly.
- Enable audit logging for storage operations.
Sample Workflow Config¶
name: lakehouse-gcp-to-pg
version: v1
type: workflow
tags:
- workflow
- nilus-batch
description: Nilus Batch Service Sample
# workspace: public
workflow:
dag:
- name: gcp-pg
spec:
stack: nilus:1.0
compute: runnable-default
resources:
requests:
cpu: 100m
memory: 256Mi
logLevel: Info
envs:
PAGE_SIZE: 50000
LOADER_FILE_SIZE: 50000000
stackSpec:
source:
address: dataos://gcslakesrc
options:
source-table: "sandbox1.validation_data_types"
sql-exclude-columns: __metadata
sink:
address: dataos://ncdcpostgres3
options:
dest-table: varun_testing.data_type_validation
incremental-strategy: replace
Info
Ensure that all placeholder values and required fields (e.g., connection addresses, slot names, and access credentials) are properly updated before applying the configuration to a DataOS workspace.
Deploy the manifest file using the following command:
Supported Attribute Details¶
Nilus supports the following source options for GCP-backed DataOS Lakehouse:
Option | Required | Description |
---|---|---|
source-table |
Yes | Table name (schema.table ) |
sql-exclude-columns |
Optional | To exclude columns(e.g., metadata column _metadata ) |
staging-bucket |
Optional | GCS bucket for staging operations |
metastore_url |
No | External metastore URL |