GCP-backed DataOS Lakehouse¶

The DataOS Lakehouse (GCP-backed) provides a secure, scalable, and cloud-native data storage and analytics layer built on Google Cloud Storage (GCS), using Apache Iceberg or Delta Lake as table formats.

Nilus supports the Lakehouse as both a source and a sink, enabling efficient batch data movement to and from the DataOS ecosystem.

Connections to the GCP Lakehouse are managed only through DataOS Depot, which centralizes authentication and credentials by offering UDL as:

dataos://<lakehouse-name>

Prerequisites¶

Following are the requirements for enabling Batch Data Movement in GCP-backed DataOS Lakehouse:

Environment Variables¶

For GCP-backed Lakehouse, the following environment variables must be set (via Depot or workflow):

Variable	Description
`TYPE`	Must be set to `GCS`
`DESTINATION_BUCKET`	GCS URL in format `gs://<bucket>/<path>`
`GCS_CLIENT_EMAIL`	Service account email
`GCS_PROJECT_ID`	GCP project ID
`GCS_PRIVATE_KEY`	Service account private key
`GCS_JSON_KEY_FILE_PATH`	Path to service account JSON key file
`METASTORE_URL`	(Optional) External metastore URL

Info

Contact the DataOS Administrator or Operator to obtain the configured Depot UDL.

Authentication Method¶

Nilus supports authentication via HMAC credentials for GCP:

HMAC Credentials(Hash-based Message Authentication Code)

GCS_ACCESS_KEY_ID: HMAC key ID
GCS_SECRET_ACCESS_KEY: HMAC secret
Useful for S3-compatible access scenarios.

Required GCP Setup¶

GCS Bucket
1. Create the target bucket.
2. Configure access control (IAM roles, ACLs).
3. Enable versioning and lifecycle management as needed.
Service Account
1. Create a service account.
2. Generate JSON key file.
3. Assign required roles:
4. roles/storage.objectViewer
5. roles/storage.objectCreator
6. roles/storage.admin (if managing bucket metadata)
Security
1. Configure IAM policies.
2. Rotate keys regularly.
3. Enable audit logging for storage operations.

Sample Workflow Config¶

name: lakehouse-gcp-to-pg
version: v1
type: workflow
tags:
  - workflow
  - nilus-batch
description: Nilus Batch Service Sample
# workspace: public
workflow:
  dag:
    - name: gcp-pg
      spec:
        stack: nilus:1.0
        compute: runnable-default
        resources:
          requests:
            cpu: 100m
            memory: 256Mi
        logLevel: Info
        envs:
          PAGE_SIZE: 50000
          LOADER_FILE_SIZE: 50000000
        stackSpec:
          source:
            address: dataos://gcslakesrc
            options:
              source-table: "sandbox1.validation_data_types"
              sql-exclude-columns: __metadata 
          sink:
            address: dataos://ncdcpostgres3
            options:
              dest-table: varun_testing.data_type_validation
              incremental-strategy: replace

Info

Ensure that all placeholder values and required fields (e.g., connection addresses, slot names, and access credentials) are properly updated before applying the configuration to a DataOS workspace.

Deploy the manifest file using the following command:

dataos-ctl resource apply -f ${{path to the Nilus Workflow YAML}}

Supported Attribute Details¶

Nilus supports the following source options for GCP-backed DataOS Lakehouse:

Option	Required	Description
`source-table`	Yes	Table name (`schema.table`)
`sql-exclude-columns`	Optional	To exclude columns(e.g., metadata column `_metadata`)
`staging-bucket`	Optional	GCS bucket for staging operations
`metastore_url`	No	External metastore URL

Core Concepts

Storage Formats
- Iceberg: schema evolution, partition pruning, time travel.
- Delta Lake: ACID transactions, schema enforcement, version history.
- Parquet: columnar file format underpinning both.
Path Format

  gs://<bucket>/<relativePath>/<collection>/<dataset>