Best Practices¶
Following the given recommendations helps ensure stable behavior, avoids runtime warnings, and supports efficient execution of data pipelines.
1. Identifier Normalization¶
Purpose: Avoid unexpected transformations due to automatic normalization of identifiers.
- Use underscores (
_
) instead of hyphens (-) in all custom identifiers, such as:dataset_name
destination_table
topic_prefix
- Hyphens are automatically converted to underscores during runtime, which can affect:
- Database schema names
- Folder paths
- This transformation does not alter the original configuration in the pipeline manifest.
Example Warning:
[WARNING]|_normalize_identifier:243|Due to normalization dataset name got changed from 'A-test-01' to 'a_test_01'...
Recommendation:
Avoid using hyphens in identifiers to prevent normalization and potential naming mismatches.
2. Replication Slot Management (PostgreSQL CDC Service)¶
Purpose: Prevent conflicts and ensure consistent replication behavior in Change Data Capture Service.
- Each CDC Service must be assigned a unique replication slot.
- Reusing a replication slot across services may result in:
- Replication state conflicts
- Duplicate or inconsistent data
- Service errors
Example Error:
Recommendation:
Define a dedicated replication slot per CDC Service to maintain accurate state tracking and avoid replication overlap.
3. Defining Incremental and Primary Keys in Batch Workflow¶
Purpose: Enable accurate delta processing and maintain data integrity in Nilus batch workflows.
- Explicitly declare both
primary-key
andincremental-key
undersource.options
. - These keys determine:
- How data changes are identified between runs
- Uniqueness and deduplication logic
Example Manifest:
name: nilus-batch-pg-test
version: v1
type: workflow
tags:
- workflow
- nilus-batch
description: Nilus Batch Service Sample
workspace: public
workflow:
schedule:
cron: '*/4 * * * *'
concurrencyPolicy: Allow
dag:
- name: batch-job-01
spec:
stack: nilus:1.0
compute: runnable-default
resources:
requests:
cpu: 100m
memory: 128Mi
logLevel: INFO
stackSpec:
source:
address: dataos://ncdcpostgres
options:
source-table: ecom.product
primary-key: "id"
incremental-key: "id"
sink:
address: dataos://testinglh
options:
dest-table: B_007_batch.ecom_product_test
incremental-strategy: append
Recommendation:
Always specify both primary-key
and incremental-key
. Omitting these keys can result in faulty delta detection and incorrect replication.
4. Specifying Heartbeat Settings in PostgreSQL CDC Service¶
Purpose: Maintain real-time change data capture (CDC) synchronization and regularly check the health of replication streams.
- Always define both
heartbeat.interval.ms
andtopic.heartbeat.prefix
undersource.options
. - These properties help maintain reliable CDC operations by:
- Detecting connectivity issues between source and sink
- Emitting regular heartbeat messages even during periods of low change volume
Example Manifest:
name: ${{service-name}} # Service identifier
version: v1 # Version of the service
type: service # Defines the resource type
tags: # Classification tags
- ${{tag}}
- ${{tag}}
description: Nilus CDC Service for Postgres description # Description of the service
workspace: public # Workspace where the service is deployed
service: # Service specification block
servicePort: 9010 # Service port
replicas: 1 # Number of replicas
logLevel: INFO # Logging level
compute: ${{query-default}} # Compute profile
persistentVolume: # Persistent volume configuration
name: ${{ncdc-vo1-01}} # Volume name (multiple options commented)
directory: ${{nilus_01}} # Target directory within the volume
stack: nilus:3.0 # Nilus stack version
stackSpec: # Stack specification
source: # Source configuration block
address: dataos://postgresdepot # Source depot address/UDL
options: # Source-specific options
engine: debezium # Required for CDC; not used for batch ingestion
table.include.list: "public.customers" # Tables to include from source
topic.prefix: "cdc_changelog" # Required topic prefix, can be customized
slot.name: "test3" # Required replication slot name, must be unique
heartbeat.interval.ms: 60000 # Required heartbeat interval (ms)
topic.heartbeat.prefix: "nilus_heartbeat" # Required heartbeat topic prefix
sink: # Sink configuration block
address: dataos://testinghouse # Sink DataOS Lakehouse address
options: # Sink-specific options
dest-table: pgdb_lenovo_test_004 # Destination table name in sink
incremental-strategy: append # Append mode for CDC write strategy
aws_region: us-west-2 # AWS region for S3-backed DataOS Lakehouse
Recommendation:
Always include heartbeat.interval.ms
and topic.heartbeat.prefix
in CDC for PostgreSQL. Their absence can lead to undetected pipeline failures or broken replication streams.