Nilus Configurations¶

CDC Service Config¶

For Change Data Capture, Nilus is orchestrated via the Service resource. The example below demonstrates a service configured to monitor a MongoDB collection and capture change events in the DataOS Lakehouse.

Example CDC Manifest Configuration¶

name: ${{service-name}}                                    # Service identifier
version: v1                                                # Version of the service
type: service                                              # Defines the resource type
tags:                                                      # Classification tags
  - ${{tag}}
  - ${{tag}}
description: Nilus CDC Service for MongoDB description     # Description of the service
workspace: public                                          # Workspace where the service is deployed

service:                                                   # Service specification block
  servicePort: 9010                                        # Service port
  replicas: 1                                              # Number of replicas
  logLevel: INFO                                           # Logging level
  compute: ${{query-default}}                              # Compute type
  resources:                                               # Resource requests (optional but recommended)
    requests:
      cpu: 100m                                            # Requested CPU
      memory: 128Mi                                        # Requested memory
  stack: nilus:3.0                                         # Nilus stack version
  stackSpec:                                               # Stack specification
    source:                                                # Source configuration block
      address: ${{source_depot_address/UDL}}               # Source depot address/UDL
      options:                                             # Source-specific options
        engine: debezium                                   # Required CDC engine; used for streaming changes
        collection.include.list: "retail.products"         # MongoDB collections to include
        topic.prefix: "cdc_changelog"                      # Required topic prefix for CDC stream
        max-table-nesting: "0"                             # Optional; prevents unnesting of nested documents
        transforms.unwrap.array.encoding: array            # Optional; preserves arrays in sink as-is
    sink:                                                  # Sink configuration for CDC output
      address: ${{sink_depot_address/UDL}}                 # Sink depot address
      options:                                             # Sink-specific options
        dest-table: mdb_test_001                           # Destination table name in the sink depot
        incremental-strategy: append                       # Append-only strategy for streaming writes

CDC Configuration Attributes¶

1. Metadata¶

Field	Description
`name`	Unique service identifier
`version`	Configuration version
`type`	Must be `service`
`tags`	Classification tags
`description`	Describes the service
`workspace`	Namespace for the service

2. Service Specification¶

Field	Description
`servicePort`	Internal port exposed by the service
`replicas`	Number of instances to run
`logLevel`	Logging verbosity level
`compute`	Compute profile for workload placement
`resources.requests`	Guaranteed compute resources
`resources.limits`	Max compute resources allowed
`stack`	Specifies the stack to use. Check all available Nilus stacks in the Operations App.

3. Stack Specification¶

3.1 Source¶

The source section defines how to connect to the source system (MongoDB in this example).

source:
  address: dataos://testingmongocdc
  options:
    engine: debezium
    collection.include.list: "sample.unnest"
    topic.prefix: "cdc_changelog"
    max-table-nesting: "0"
    transforms.unwrap.array.encoding: array

Address

Can be:
A Depot path (as shown above)
A connection string (for direct connections)

Info

When sourcing from a Depot, no dataosSecrets are required.

Options

Option	Description	Required
`engine`	Must be `debezium` to enable CDC processing	Yes
`collection.include.list`	List of MongoDB collections to monitor (namespace: collection)	Yes
`topic.prefix`	Prefix for CDC topics; appended to the final dataset name in the sink	Yes
`max-table-nesting`	Degree of JSON nesting to unnest (MongoDB specific). • Accepts string digits: `"0"`, `"1"`, etc. • `"0"` means no unnesting. • Higher values control recursive flattening.	Optional
`transforms.unwrap.array.encoding`	Controls encoding for array elements	Optional
Other source-specific options	Vary depending on the source database. Refer to Nilus documentation or your DataOS contact.	As applicable

3.2 Sink¶

The sink section defines where the captured change data will be written.

sink:
  address: dataos://testinglh
  options:
    dest-table: mdb_test
    incremental-strategy: append
    aws_region: us-west-2

Options

Field	Description
`address`	Target Lakehouse address
`dest-table`	Schema and table to write change records
`incremental-strategy`	Defines write mode; `append` for CDC is common
Additional options (e.g., `aws_region`)	Sink-specific configurations (depends on Lakehouse setup)

Batch Workflow Config¶

For batch data movement, Nilus is orchestrated using the Workflow resource. The example below demonstrates a workflow configured to ingest data from Salesforce and load it into the DataOS Lakehouse.

Example Batch Manifest Configuration¶

name: salesforce-account-wf-test                     # Workflow identifier
version: v1                                          # Workflow version
type: workflow                                       # Defines the resource type
tags:                                                # Classification tags
  - salesforce_account
  - nilus_batch
  - client_lakehouse

workflow:                                            # Workflow specification block
  # schedule:                                        # Optional: Workflow schedule
  #   cron: '55 08 * * *'                            # Run every day at 08:55 UTC
  #   endOn: 2025-12-31T23:59:45Z                    # Optional end time for the schedule
  #   concurrencyPolicy: Forbid                     # Prevent concurrent runs

  dag:                                               # Directed Acyclic Graph definition
    - name: sf-ac                                    # DAG node name
      spec:                                          # Node specification
        stack: nilus:1.0                             # Nilus stack version
        compute: runnable-default                    # Compute profile for execution
        resources:                                   # Resource requests (optional but recommended)
          requests:
            cpu: 100m                                # Requested CPU
            memory: 128Mi                            # Requested memory
        logLevel: INFO                               # Logging verbosity level

        dataosSecrets:                               # Secrets for source connectivity
          - name: salesforce-sandbox                 # DataOS secret name
            allKeys: true                            # Mount all secret keys
            consumptionType: envVars                 # Inject secrets as environment variables

        stackSpec:                                   # Stack specification
          source:                                    # Source configuration
            address: "salesforce://?username={SALESFORCE_USERNAME}&password={SALESFORCE_PASSWORD}&token={SALESFORCE_TOKEN}&domain={SALESFORCE_DOMAIN}"
                                                      # Salesforce connection string
            options:                                 # Source-specific options
              source-table: "account"                # Salesforce object/table to ingest

          sink:                                      # Sink configuration
            address: dataos://client                 # Target Lakehouse address
            options:                                 # Sink-specific options
              dest-table: dev.accounts_test           # Destination schema and table
              incremental-strategy: append            # Append new data

Batch Configuration Attributes¶

1. Metadata¶

Field	Description
`name`	Unique name of the workflow
`version`	Workflow version identifier
`type`	Must be `workflow`
`tags`	Categorization tags for search and organization

2. Workflow Definition¶

2.1 Schedule (Optional)¶

Defines when and how often the workflow runs.

schedule:
  cron: '55 08 * * *'
  endOn: 2025-12-31T23:59:45Z
  concurrencyPolicy: Forbid

Info

If not defined, the workflow must be triggered manually.

2.2 DAG (Directed Acyclic Graph)¶

Defines the processing steps in the workflow.

dag:
  - name: sf-ac
    spec:
      ...

2.3 Stack¶

Specifies the stack to use. Check all available Nilus stacks in the Operations App.

stack: nilus:1.0

2.4 Compute¶

Defines which compute profile to use. Check all available computes in the Operations App.

compute: runnable-default

2.5 Resources¶

Specifies resource requests (optional, but recommended):

resources:
  requests:
    cpu: 100m
    memory: 128Mi

2.6 logLevel¶

Controls the logging level (optional):

logLevel: <INFO/DEBUG/ERROR>

2.7 dataosSecrets (Needed only when not working with Depot)¶

Defines secret references required for source connectivity (if applicable)

dataosSecrets:
  - name: salesforce-sandbox
    allKeys: true
    consumptionType: envVars

Info

If using a Depot as the source, this section is not required.

3. Stack Specification¶

3.1 Source¶

Defines the source connection for ingestion.

source:
  address: "salesforce://?username={SALESFORCE_USERNAME}&password={SALESFORCE_PASSWORD}&token={SALESFORCE_TOKEN}&domain={SALESFORCE_DOMAIN}"
  options:
    source-table: "account"

Options

Field Name	Type	is_required?	Description
`source-table`	string	YES	Source table/entity. Use schema.table or prefix with `query:` for SQL.
`primary-key`	string	NO	Primary key for deduplication. Comma-separated for multiple keys.
`incremental-key`	string	NO	Column used for incremental loads.
`interval-start`	string	NO	Start of time range (ISO).
`interval-end`	string	NO	End of time range (ISO).
`type-hints`	object	NO	Destination type overrides.
`page-size`	int	NO	Rows per page fetched. Default: 50000.
`extract-parallelism`	int	NO	Parallel extract workers. Default: 5.
`sql-reflection-level`	enum	NO	none, fast, full (default).
`sql-limit`	int	NO	Max rows to read.
`sql-exclude-columns`	list[string]	NO	Columns to exclude.
`yield-limit`	int	NO	Max pages yielded.
`mask`	object	NO	Column masking rules.
`max-table-nesting`	int	NO	Max nesting depth. Default: 0.

Info

Default behavior for NULL columns:

SQL/schema-known sources preserve null columns.
Semi-structured sources materialize columns only when values appear, unless overridden via type-hints.

3.2 Sink¶

Defines the target where data will be written.

sink:
  address: dataos://demolakehouse
  options:
    dest-table: retai.accounts_test
    incremental-strategy: append

Options

Field Name	Type	is_required?	Description
`dest-table`	string	YES	Destination table (schema.table).
`incremental-strategy`	enum	YES	append, replace, merge.
`aws_region`	string	NO	Override AWS region.
`partition-by`	string	NO	Partition column.
`cluster-by`	string	NO	Cluster/sort column.
`full-refresh`	bool	NO	Reload all data.
`staging_bucket`	string	NO	Temporary bucket.
`loader-file-size`	int	NO	Target rows per output file.