Skip to content

Quick Start

This page will help you onboard rapidly and begin moving data with Nilus using either Batch Ingestion or Change Data Capture (CDC). Whether you're syncing large historical datasets or streaming real-time updates, Nilus makes data movement into DataOS Lakehouse and other destinations simple, secure, and scalable.

Change Data Capture (CDC)

CDC identifies and streams row-level changes (inserts, updates, and deletes) from a source database to a target system, keeping the target system up to date without requiring full reloads.

Info

This Quick Start example displays MongoDB as the source system and DataOS Lakehouse as the destination system. To explore more source and destination options, visit:

Prerequisites

The following are mandatory requirements that need to be completed to make CDC work:

MongoDB Replica Set

  • MongoDB must run as a replica set, even for single-node deployments.
  • Nilus CDC for MongoDB relies on the oplog.rs collection, which is only available in replica sets.

Enable oplog Access

  • Nilus uses MongoDB's oplog.rs to capture changes.
  • Nilus requires a user with read access to business data and internal system databases to access the oplog. If the user is not created, create a user in MongoDB using the following:

    db.createUser({
      user: "debezium",
      pwd: "dbz",
      roles: [
        { role: "read", db: "your_app_db" },      // Read target database
        { role: "read", db: "local" },            // Read oplog
        { role: "read", db: "config" },           // Read cluster configuration
        { role: "readAnyDatabase", db: "admin" }, // Optional: discovery
        { role: "clusterMonitor", db: "admin" }   // Recommended: monitoring
      ]
    })
    

Info

Grant only the roles required for your environment to follow the principle of least privilege.

Pre-created MongoDB Depot

A Depot must exist in DataOS with read-write access. To check the Depot, go to the Metis UI of the DataOS or use the following command:

dataos-ctl resource get -t depot -a

#Expected Output
NFO[0000] 🔍 get...                                     
INFO[0000] 🔍 get...complete 
|    NAME      | VERSION | TYPE  | STATUS | OWNER    |
| ------------ | ------- | ----- | ------ | -------- |
| mongodbdepot | v2alpha | depot | active | usertest |

If the Depot is not created, use the following manifest configuration template to create:

MongoDB Depot Manifest
name: ${{depot-name}}
version: v2alpha
type: depot
tags:
  - ${{tag1}}
  - ${{tag2}}
layer: user
depot:
  type: mongodb                                 
  description: ${{description}}
  compute: ${{runnable-default}}
  mongodb:                                          
    subprotocol: ${{\"mongodb+srv\"}}
    nodes: ${{[\"clusterabc.ezlggfy.mongodb.net\"]}}
  external: ${{true}}
  secrets:
    - name: ${{instance-secret-name}}-r
      allkeys: ${{true}}

    - name: ${{instance-secret-name}}-rw
      allkeys: ${{true}}

Info

Update variables such as name, owner, compute, layer, etc., and contact the DataOS Administrator or Operator to obtain the appropriate secret name.

CDC Manifest Configuration

The following manifest defines a Nilus CDC service that captures changes from a MongoDB source and persists them into the DataOS Lakehouse (S3 Iceberg).

name: ${{service-name}}                                    # Service identifier
version: v1                                                # Version of the service
type: service                                              # Defines the resource type
tags:                                                      # Classification tags
    - ${{tag}}                                              
    - ${{tag}}                                              
description: Nilus CDC Service for MongoDB description     # Description of the service
workspace: public                                          # Workspace where the service is deployed

service:                                                   # Service specification block
  servicePort: 9010                                        # Service port
  replicas: 1                                              # Number of replicas
  logLevel: INFO                                           # Logging level
  compute: ${{query-default}}                              # Compute profile
  stack: nilus:3.0                                         # Nilus stack version
  stackSpec:                                               # Stack specification
    source:                                                # Source configuration block
      address: ${{source_depot_address/UDL}}               # Source depot address/UDL
      options:                                             # Source-specific options
        engine: debezium                                   # Required CDC engine; used for streaming changes
        collection.include.list: "retail.products"         # MongoDB collections to include
        topic.prefix: "cdc_changelog"                      # Required topic prefix for CDC stream
        max-table-nesting: "0"                             # Optional; prevents unnesting of nested documents
        transforms.unwrap.array.encoding: array            # Optional; preserves arrays in sink as-is
    sink:                                                  # Sink configuration for CDC output
      address: ${{sink_depot_address/UDL}}                 # Sink depot address
      options:                                             # Sink-specific options
        dest-table: mdb_test_001                           # Destination table name in the sink depot
        incremental-strategy: append                       # Append-only strategy for streaming writes

Info

Ensure that all placeholder values and required fields (e.g., connection addresses, slot names, and access credentials) are properly updated before applying the configuration to a DataOS workspace.

The above sample manifest file is deployed using the following command:

dataos-ctl resource apply -f ${{path to the Nilus Service YAML}}

The MongoDB host used in the CDC service YAML must exactly match the host defined during replica set initiation.

CDC Attribute Details

This section outlines the necessary attributes of Nilus CDC Service.

Attribute Details

Metadata Fields

Field Name Description
name Unique service identifier
version Configuration version
type Must be service
tags Classification tags
description Describes the service
workspace Namespace for the service

Service Specification Fields

Field Name Description
servicePort Internal port exposed by the service
replicas Number of instances to run
logLevel Logging verbosity level
compute Compute profile for workload placement
resources.requests Guaranteed compute resources
resources.limits Max compute resources allowed
stack Specifies the stack to use with version.

Source Configuration Fields

Field Name Description Required
address The address to the source
1. Can be connected using a Depot
2. Can be connected directly using connection string, but it will need a secret
Yes
engine Must be debezium to enable CDC processing Yes
collection.include.list List of database collections to monitor (namespace: collection) (example MongoDB) Yes
table.include.list List of database tables to monitor (sandbox.customers) (example SQL Server) Yes
topic.prefix Prefix for CDC topics; appended to the final dataset name in the sink Yes
max-table-nesting Degree of JSON nesting to unnest (MongoDB specific).
• Accepts string representation of digits: "0", "1", etc.
• Default is 0 if no value is set
"0" means no unnesting (nested fields will be left as-is).
• Higher values control recursive flattening for nested documents.
Optional
transforms.unwrap.array.encoding Controls encoding for array elements Optional

Sink Configuration Fields

Field Name Description Required
address Target address (DataOS Lakehouse as example) Yes
dest-table Schema to write change records. Table name will fetched from topic.prefix Yes
incremental-strategy Defines write mode; append is preferred for CDC Yes

Batch Ingestion

Batch ingestion transfers data from sources to the destination system on a scheduled basis (hourly, daily, or weekly).

Info

This Quick Start example displays MongoDB as the source system and DataOS Lakehouse as the destination system. To explore more source and destination options, visit:

Prerequisites

The following are the requirements for enabling Batch Data Movement in MongoDB:

Database User Permissions

The connection user must have at least read privileges on the source collection:

{
  "role": "read",
  "db": "<database_name>",
  "collection": "<collection_name>"
}

Pre-created MongoDB Depot

A Depot must exist in DataOS with read-write access. To check the Depot, go to the Metis UI of the DataOS or use the following command:

dataos-ctl resource get -t depot -a

#Expected Output
NFO[0000] 🔍 get...                                     
INFO[0000] 🔍 get...complete 
|    NAME      | VERSION | TYPE  | STATUS | OWNER    |
| ------------ | ------- | ----- | ------ | -------- |
| mongodbdepot | v2alpha | depot | active | usertest |

If the Depot is not created, use the following manifest configuration template to create:

MongoDB Depot Manifest
name: ${{depot-name}}
version: v2alpha
type: depot
tags:
  - ${{tag1}}
  - ${{tag2}}
layer: user
depot:
  type: mongodb                                 
  description: ${{description}}
  compute: ${{runnable-default}}
  mongodb:                                          
    subprotocol: ${{\"mongodb+srv\"}}
    nodes: ${{[\"clusterabc.ezlggfy.mongodb.net\"]}}
  external: ${{true}}
  secrets:
    - name: ${{instance-secret-name}}-r
      allkeys: ${{true}}

    - name: ${{instance-secret-name}}-rw
      allkeys: ${{true}}

Info

Update variables such as name, owner, compute, layer, etc., and contact the DataOS Administrator or Operator to obtain the appropriate secret name.

Batch Manifest Configuration

The following manifest defines a Nilus Batch Workflow that transfers data from a MongoDB source into the DataOS Lakehouse (S3 Iceberg).

name: nb-mdb-test-01
version: v1
type: workflow
tags:
    - workflow
    - nilus-batch
description: Nilus Batch Workflow Sample for MongoDB to S3 Lakehouse
workspace: research
workflow:
  dag:
    - name: nb-job-01
      spec:
        stack: nilus:1.0
        compute: runnable-default
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
        logLevel: INFO
        stackSpec:
          source:
            address: dataos://mongodbdepot
            options:
              source-table: "retail.customer"
          sink:
            address: dataos://testinglh
            options:
              dest-table: mdb_retail.batch_customer_1
              incremental-strategy: replace

Info

Ensure that all placeholder values and required fields (e.g., connection addresses, slot names, and access credentials) are properly updated before applying the configuration to a DataOS workspace.

The above sample manifest file is deployed using the following command:

dataos-ctl resource apply -f ${{path to the Nilus Service YAML}}

Batch Attribute Details

This section outlines the necessary attributes of Nilus Batch Workflow.

Attribute Details

Metadata Fields

Field Name Description
name Unique name of the batch workflow
version Batch workflow version identifier
type Must be workflow
tags Categorization tags for search and organization

Workflow Specification Fields

Field Name Description
schedule Defines the frequency and schedule for workflow runs. This parameter is optional. If not specified, the workflow is triggered only once at run time.
dag dag (Directed Acyclic Graph) specifies the sequence of processing steps in the workflow.
stack Identifies the Nilus stack to be used. Available versions can be viewed in the Operations App.
compute Defines the compute profile to be used. Available profiles can be viewed in the Operations App.
resources Specifies the resources allocated to complete the batch workflow. This parameter is optional.
loglevel Defines the level of logging to be applied for the workflow.
dataosSecrets Points to the secret resource that stores credentials required to connect to a source. This is applicable for non-depotable sources. If the source supports a depot, this parameter is not required.

Source Specification Fields

Field Name Description Required
address The address to the source. 1. Can be connected using a Depot 2. Can be connected directly using a connection string, but it requires a secret. Yes
source-table Name of the source table/entity from which the data is extracted. (Salesforce object name in the above example.) Yes
primary-key Primary key of the source table. No
incremental-key Key used for incremental data ingestion. No
interval-start Start of the interval for data ingestion. No
interval-end End of the interval for data ingestion. No
columns List of columns to be included (as string). No

Sink Configuration Fields

Field Name Description Required
address Target address (DataOS Lakehouse as example) Yes
dest-table Schema to write change records. Destination table is defined in schema.table_name format Yes
incremental-strategy Data ingestion strategy (append or replace) Yes