Skip to content

Creating your first data pipeline

In this topic, you navigate through the process of building a batch data pipeline using DataOS Resources. By the end of this topic, you'll be able to create and manage efficient batch data pipelines in your own DataOS environment.

Scenario

As a skilled Data Engineer/data Product Developer, you are presented with a new challenge: your team needs a data pipeline to transfer data from one data source to another. This task involves orchestrating data ingestion, processing, and writing tasks into a seamless pipeline.

Steps to follow

  1. Identify data sources: Determine the specific input and output data sources for the pipeline.
  2. Select the right DataOS Resource: Maps the pipeline requirements to the most appropriate DataOS Resource.
  3. Choose the right Stack: Select the ideal Stack to handle the pipeline’s data processing needs.
  4. Create a manifest file: Configure the pipeline using a YAML manifest file.
  5. Apply the manifest file: Finally, activate the pipeline using the DataOS CLI.

Step 1: Identifying the data source

Begin by identifying the input (Data Source A) and output (Data Source B) for the pipeline. Within DataOS, this involves understanding the characteristics of each data source:

  • Data source A: A PostgreSQL database containing raw transactional data.
  • Data source B: An S3 bucket for storing processed data in Parquet format.

Step 2: Creating Depots

In DataOS, pipelines can only be created for data sources connected using Depots. You need to ensure that Depots are built on top of the specific data sources. If they are not, please refer to the Data Source Connectivity Module, to establish Depots on the specific source system.

Step 3: Identifying the right DataOS Resource

Review the three primary DataOS Resources used for building pipelines in DataOS—Workflow, Service, and Worker—to determine which fits your use case.

Characteristic Workflow Service Worker
Overview Orchestrates sequences of tasks that terminate upon completion. A continuous process that serves API requests. Executes specific tasks indefinitely.
Execution Model Batch processing using DAGs. API-driven execution. Continuous task execution.
Ideal Use Case Batch data processing pipelines and scheduled jobs. Real-time data retrieval or user interaction. Event-driven or real-time analytics.

Given the requirements for a batch pipeline, you can select the Workflow Resource, as it is designed for orchestrating multi-step data processing tasks.

Step 4: Identifying the right Stack

DataOS provides several pre-defined stacks to handle various processing needs. Based on the requirement, you need to select the appropriate processing Stack.

Stack Purpose
Scanner Extracting metadata from source system
Flare Batch data processing. ETL.
Benthos Stream processing.
Soda Data quality checks.
CLI Stack For automated CLI command execution

here for the given scenario, you can choose the Flare Stack for its robust capabilities in batch data processing. The Flare Stack enables you to efficiently read, process, and write data.

Step 4: Creating the manifest file

After deciding upon the suitable processing Stack, you need to draft the manifest file to configure the pipeline. Specify the Workflow Resource, define the input and output data sources, and integrate the Flare Stack.

version: v1
name: postgres-read-01
type: workflow
tags:
  - bq
  - dataset
description: This job read and write data from to postgres
title: Read Write Postgres
workflow:
  dag:
    - name: read-postgres
      title: Read Postgres
      description: This job read data from postgres
      spec:
        tags:
          - Connect
        stack: flare:6.0
        compute: runnable-default

        flare:
          job:
            explain: true
            inputs:
              - name: city_connect
                dataset: dataos://sanitypostgres:public/postgres_write_12
                options:
                  driver: org.postgresql.Driver

            logLevel: INFO
            outputs:
              - name: input
                dataset: dataos://icebase:smoketest/postgres_read_12?acl=rw
                format: Iceberg
                options:
                  saveMode: overwrite
                description: Data set from blender
                tags:
                  - Connect
                title: Postgres Dataset
            steps:
              - sequence:
                  - name: input
                    doc: Pick all columns from cities and add version as yyyyMMddHHmm formatted
                      timestamp.
                    sql: SELECT * FROM city_connect limit 10 

Step 5: Applying the manifest file

With the manifest file complete, use the DataOS CLI to deploy the pipeline:

dataos-ctl apply -f batch-data-pipeline.yaml

Verify the pipeline’s activation by checking the status of the Workflow Resource:

dataos-ctl get -t workflow -w public

By the end of this process, you have successfully created a batch data pipeline that automated the transfer of data from PostgreSQL to S3. Ready to take on your next data pipeline challenge? Follow the next steps and start building your own workflows in DataOS!

Next steps

You are now equipped to handle batch data pipelines efficiently. As you move forward, you can explore additional features and capabilities in DataOS to enhance pipeline robustness and scalability: