Quick Start¶
This page will help you onboard rapidly and begin moving data with Nilus using either Batch Ingestion or Change Data Capture (CDC). Whether you're syncing large historical datasets or streaming real-time updates, Nilus makes data movement into DataOS Lakehouse and other destinations simple, secure, and scalable.
Change Data Capture (CDC)¶
CDC identifies and streams row-level changes (inserts, updates, and deletes) from a source database to a target system, keeping the target system up to date without requiring full reloads.
Info
This Quick Start example displays MongoDB as the source system and DataOS Lakehouse as the destination system. To explore more source and destination options, visit:
Prerequisites¶
The following are mandatory requirements that need to be completed to make CDC work:
MongoDB Replica Set
- MongoDB must run as a replica set, even for single-node deployments.
- Nilus CDC for MongoDB relies on the
oplog.rs
collection, which is only available in replica sets.
Enable oplog
Access
- Nilus uses MongoDB's
oplog.rs
to capture changes. -
Nilus requires a user with
read
access to business data and internal system databases to access theoplog
. If the user is not created, create a user in MongoDB using the following:db.createUser({ user: "debezium", pwd: "dbz", roles: [ { role: "read", db: "your_app_db" }, // Read target database { role: "read", db: "local" }, // Read oplog { role: "read", db: "config" }, // Read cluster configuration { role: "readAnyDatabase", db: "admin" }, // Optional: discovery { role: "clusterMonitor", db: "admin" } // Recommended: monitoring ] })
Info
Grant only the roles required for your environment to follow the principle of least privilege.
Pre-created MongoDB Depot
A Depot must exist in DataOS with read-write access. To check the Depot, go to the Metis UI of the DataOS or use the following command:
dataos-ctl resource get -t depot -a
#Expected Output
NFO[0000] 🔍 get...
INFO[0000] 🔍 get...complete
| NAME | VERSION | TYPE | STATUS | OWNER |
| ------------ | ------- | ----- | ------ | -------- |
| mongodbdepot | v2alpha | depot | active | usertest |
If the Depot is not created, use the following manifest configuration template to create:
MongoDB Depot Manifest
name: ${{depot-name}}
version: v2alpha
type: depot
tags:
- ${{tag1}}
- ${{tag2}}
layer: user
depot:
type: mongodb
description: ${{description}}
compute: ${{runnable-default}}
mongodb:
subprotocol: ${{\"mongodb+srv\"}}
nodes: ${{[\"clusterabc.ezlggfy.mongodb.net\"]}}
external: ${{true}}
secrets:
- name: ${{instance-secret-name}}-r
allkeys: ${{true}}
- name: ${{instance-secret-name}}-rw
allkeys: ${{true}}
Info
Update variables such as name
, owner
, compute
, layer
, etc., and contact the DataOS Administrator or Operator to obtain the appropriate secret name.
CDC Manifest Configuration¶
The following manifest defines a Nilus CDC service that captures changes from a MongoDB source and persists them into the DataOS Lakehouse (S3 Iceberg).
name: ${{service-name}} # Service identifier
version: v1 # Version of the service
type: service # Defines the resource type
tags: # Classification tags
- ${{tag}}
- ${{tag}}
description: Nilus CDC Service for MongoDB description # Description of the service
workspace: public # Workspace where the service is deployed
service: # Service specification block
servicePort: 9010 # Service port
replicas: 1 # Number of replicas
logLevel: INFO # Logging level
compute: ${{query-default}} # Compute profile
stack: nilus:3.0 # Nilus stack version
stackSpec: # Stack specification
source: # Source configuration block
address: ${{source_depot_address/UDL}} # Source depot address/UDL
options: # Source-specific options
engine: debezium # Required CDC engine; used for streaming changes
collection.include.list: "retail.products" # MongoDB collections to include
topic.prefix: "cdc_changelog" # Required topic prefix for CDC stream
max-table-nesting: "0" # Optional; prevents unnesting of nested documents
transforms.unwrap.array.encoding: array # Optional; preserves arrays in sink as-is
sink: # Sink configuration for CDC output
address: ${{sink_depot_address/UDL}} # Sink depot address
options: # Sink-specific options
dest-table: mdb_test_001 # Destination table name in the sink depot
incremental-strategy: append # Append-only strategy for streaming writes
Info
Ensure that all placeholder values and required fields (e.g., connection addresses, slot names, and access credentials) are properly updated before applying the configuration to a DataOS workspace.
The above sample manifest file is deployed using the following command:
The MongoDB host used in the CDC service YAML must exactly match the host defined during replica set initiation.
CDC Attribute Details¶
This section outlines the necessary attributes of Nilus CDC Service.
Attribute Details
Metadata Fields
Field Name | Description |
---|---|
name |
Unique service identifier |
version |
Configuration version |
type |
Must be service |
tags |
Classification tags |
description |
Describes the service |
workspace |
Namespace for the service |
Service Specification Fields
Field Name | Description |
---|---|
servicePort |
Internal port exposed by the service |
replicas |
Number of instances to run |
logLevel |
Logging verbosity level |
compute |
Compute profile for workload placement |
resources.requests |
Guaranteed compute resources |
resources.limits |
Max compute resources allowed |
stack |
Specifies the stack to use with version. |
Source Configuration Fields
Field Name | Description | Required |
---|---|---|
address |
The address to the source 1. Can be connected using a Depot 2. Can be connected directly using connection string, but it will need a secret |
Yes |
engine |
Must be debezium to enable CDC processing |
Yes |
collection.include.list |
List of database collections to monitor (namespace: collection) (example MongoDB) | Yes |
table.include.list |
List of database tables to monitor (sandbox.customers) (example SQL Server) | Yes |
topic.prefix |
Prefix for CDC topics; appended to the final dataset name in the sink | Yes |
max-table-nesting |
Degree of JSON nesting to unnest (MongoDB specific). • Accepts string representation of digits: "0" , "1" , etc.• Default is 0 if no value is set • "0" means no unnesting (nested fields will be left as-is).• Higher values control recursive flattening for nested documents. |
Optional |
transforms.unwrap.array.encoding |
Controls encoding for array elements | Optional |
Sink Configuration Fields
Field Name | Description | Required |
---|---|---|
address |
Target address (DataOS Lakehouse as example) | Yes |
dest-table |
Schema to write change records. Table name will fetched from topic.prefix |
Yes |
incremental-strategy |
Defines write mode; append is preferred for CDC | Yes |
Batch Ingestion¶
Batch ingestion transfers data from sources to the destination system on a scheduled basis (hourly, daily, or weekly).
Info
This Quick Start example displays MongoDB as the source system and DataOS Lakehouse as the destination system. To explore more source and destination options, visit:
Prerequisites¶
The following are the requirements for enabling Batch Data Movement in MongoDB:
Database User Permissions
The connection user must have at least read privileges on the source collection:
Pre-created MongoDB Depot
A Depot must exist in DataOS with read-write access. To check the Depot, go to the Metis UI of the DataOS or use the following command:
dataos-ctl resource get -t depot -a
#Expected Output
NFO[0000] 🔍 get...
INFO[0000] 🔍 get...complete
| NAME | VERSION | TYPE | STATUS | OWNER |
| ------------ | ------- | ----- | ------ | -------- |
| mongodbdepot | v2alpha | depot | active | usertest |
If the Depot is not created, use the following manifest configuration template to create:
MongoDB Depot Manifest
name: ${{depot-name}}
version: v2alpha
type: depot
tags:
- ${{tag1}}
- ${{tag2}}
layer: user
depot:
type: mongodb
description: ${{description}}
compute: ${{runnable-default}}
mongodb:
subprotocol: ${{\"mongodb+srv\"}}
nodes: ${{[\"clusterabc.ezlggfy.mongodb.net\"]}}
external: ${{true}}
secrets:
- name: ${{instance-secret-name}}-r
allkeys: ${{true}}
- name: ${{instance-secret-name}}-rw
allkeys: ${{true}}
Info
Update variables such as name
, owner
, compute
, layer
, etc., and contact the DataOS Administrator or Operator to obtain the appropriate secret name.
Batch Manifest Configuration¶
The following manifest defines a Nilus Batch Workflow that transfers data from a MongoDB source into the DataOS Lakehouse (S3 Iceberg).
name: nb-mdb-test-01
version: v1
type: workflow
tags:
- workflow
- nilus-batch
description: Nilus Batch Workflow Sample for MongoDB to S3 Lakehouse
workspace: research
workflow:
dag:
- name: nb-job-01
spec:
stack: nilus:1.0
compute: runnable-default
resources:
requests:
cpu: 100m
memory: 128Mi
logLevel: INFO
stackSpec:
source:
address: dataos://mongodbdepot
options:
source-table: "retail.customer"
sink:
address: dataos://testinglh
options:
dest-table: mdb_retail.batch_customer_1
incremental-strategy: replace
Info
Ensure that all placeholder values and required fields (e.g., connection addresses, slot names, and access credentials) are properly updated before applying the configuration to a DataOS workspace.
The above sample manifest file is deployed using the following command:
Batch Attribute Details¶
This section outlines the necessary attributes of Nilus Batch Workflow.
Attribute Details
Metadata Fields
Field Name | Description |
---|---|
name |
Unique name of the batch workflow |
version |
Batch workflow version identifier |
type |
Must be workflow |
tags |
Categorization tags for search and organization |
Workflow Specification Fields
Field Name | Description |
---|---|
schedule |
Defines the frequency and schedule for workflow runs. This parameter is optional. If not specified, the workflow is triggered only once at run time. |
dag |
dag (Directed Acyclic Graph) specifies the sequence of processing steps in the workflow. |
stack |
Identifies the Nilus stack to be used. Available versions can be viewed in the Operations App. |
compute |
Defines the compute profile to be used. Available profiles can be viewed in the Operations App. |
resources |
Specifies the resources allocated to complete the batch workflow. This parameter is optional. |
loglevel |
Defines the level of logging to be applied for the workflow. |
dataosSecrets |
Points to the secret resource that stores credentials required to connect to a source. This is applicable for non-depotable sources. If the source supports a depot, this parameter is not required. |
Source Specification Fields
Field Name | Description | Required |
---|---|---|
address |
The address to the source. 1. Can be connected using a Depot 2. Can be connected directly using a connection string, but it requires a secret. | Yes |
source-table |
Name of the source table/entity from which the data is extracted. (Salesforce object name in the above example.) | Yes |
primary-key |
Primary key of the source table. | No |
incremental-key |
Key used for incremental data ingestion. | No |
interval-start |
Start of the interval for data ingestion. | No |
interval-end |
End of the interval for data ingestion. | No |
columns |
List of columns to be included (as string). | No |
Sink Configuration Fields
Field Name | Description | Required |
---|---|---|
address |
Target address (DataOS Lakehouse as example) | Yes |
dest-table |
Schema to write change records. Destination table is defined in schema.table_name format |
Yes |
incremental-strategy |
Data ingestion strategy (append or replace) | Yes |