Skip to content

MongoDB

MongoDB is a popular open-source NoSQL database known for its flexibility and scalability. Nilus supports MongoDB as a batch ingestion source, allowing users to efficiently move data into the DataOS Lakehouse or other supported destinations.

Nilus connects to MongoDB through DataOS Depot, which provides a managed, secure way to store and reuse connection configurations. In Depot:

  • The configuration uses the dataos:// URI scheme
  • Authentication and SSL/TLS are handled by the Depot service
  • Secrets and connection details are centrally managed

Prerequisites

The following are the requirements for enabling Batch Data Movement in MongoDB:

Database User Permissions

The connection user must have at least read privileges on the source collection:

{
  "role": "read",
  "db": "<database_name>",
  "collection": "<collection_name>"
}

Pre-created MongoDB Depot

A Depot must exist in DataOS with read-write access. To check the Depot, go to the Metis UI of the DataOS or use the following command:

dataos-ctl resource get -t depot -a

#Expected Output
NFO[0000] 🔍 get...                                     
INFO[0000] 🔍 get...complete 
|    NAME      | VERSION | TYPE  | STATUS | OWNER    |
| ------------ | ------- | ----- | ------ | -------- |
| mongodbdepot | v2alpha | depot | active | usertest |

If the Depot is not created, use the following manifest configuration template to create the MongoDB Depot:

MongoDB Depot Manifest
name: ${{depot-name}}
version: v2alpha
type: depot
tags:
    - ${{tag1}}
    - ${{tag2}}
layer: user
depot:
  type: mongodb                                 
  description: ${{description}}
  compute: ${{runnable-default}}
  mongodb:                                          
    subprotocol: ${{"mongodb+srv"}}
    nodes: ${{["clusterabc.ezlggfy.mongodb.net"]}}
  external: ${{true}}
  secrets:
    - name: ${{instance-secret-name}}-r
      allkeys: ${{true}}

    - name: ${{instance-secret-name}}-rw
      allkeys: ${{true}}

Info

Update variables such as name, owner, compute, layer, etc., and contact the DataOS Administrator or Operator to obtain the appropriate secret name.

Sample Workflow Config

name: nb-mdb-test-01
version: v1
type: workflow
tags:
    - workflow
    - nilus-batch
description: Nilus Batch Workflow Sample for MongoDB to S3 Lakehouse
workspace: research
workflow:
  dag:
    - name: nb-job-01
      spec:
        stack: nilus:1.0
        compute: runnable-default
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
        logLevel: INFO
        stackSpec:
          source:
            address: dataos://mongodbdepot
            options:
              source-table: "retail.customer"
          sink:
            address: dataos://testinglh
            options:
              dest-table: mdb_retail.batch_customer_1
              incremental-strategy: replace
              aws_region: us-west-2

Info

Ensure that all placeholder values and required fields (e.g., connection addresses, slot names, and access credentials) are properly updated before applying the configuration to a DataOS workspace.

Deploy the manifest file using the following command:

dataos-ctl resource apply -f ${{path to the Nilus Workflow YAML}}

Supported Attribute Details

Nilus supports the following source options for MongoDB:

Option Required Description
source-table Yes Format: database.collection or database.collection:[aggregation_pipeline]
filter_ No MongoDB filter document to apply
projection No Fields to include/exclude in the result
chunk_size No Number of documents to load in each batch (default: 10000)
parallel No Enable parallel loading (default: false)
data_item_format No Format for loaded data (object or arrow)
incremental-key No Field used for incremental batch ingestion
interval-start No Optional lower bound timestamp for incremental ingestion
interval-end No Optional upper bound timestamp for incremental ingestion

Info

Nilus supports incremental batch ingestion by using a field (e.g., updated_at) to identify new or updated documents.

  • Field must be indexed for performance
  • Field must be consistently present in documents

Batch ingestion can be driven by MongoDB aggregation pipelines, enabling complex transformations before loading.