Skip to content

Workflow

The Workflow in DataOS serves as a Resource for orchestrating data processing tasks with dependencies. It enables the creation of complex data workflows by defining a hierarchy based on a dependency mechanism. To know more about workflow, refer to the following link: Core Concepts.

Workflow in the Data Product Lifecycle

Workflows are integral to the transformation phase in the Data Product Lifecycle. They are particularly useful when your transformation involves:

  • Definite Execution:sequences of tasks, jobs, or processes batch data, terminating upon successful completion or failure. For example, A workflow moving data from point A to point B.

  • Execution Processes: To process data in discrete chunks, in parallel or in a given ordered sequence of jobs in DAGs.

  • Independent Processing: Performs data-transformation, ingestion, or syndication or automating job execution based on a cron expression.

Structure of Workflow manifest

worker_manifest_structure.yml
# Resource Section
#This is a icebase to icebase workflow hence it's way of giving input and output is diff.
version: v1 
name: wf-tmdc-01 
type: workflow 
tags:
  - Connect
  - City
description: The job ingests city data from dropzone into raw zone

# Workflow-specific Section
workflow:
  title: Connect City
  dag: 

# Job 1 Specific Section
    - name: wf-job1 # Job 1 name
      title: City Dimension Ingester
      description: The job ingests city data from dropzone into raw zone
      spec:
        tags:
          - Connect
          - City
        stack: flare:6.0 # The job gets executed upon the Flare Stack, so its a Flare Job
        compute: runnable-default

        # Flare Stack-specific Section
        stackSpec:
          driver:
            coreLimit: 1100m
            cores: 1
            memory: 1048m
          job:
            explain: true  #job section will contain explain, log-level, inputs, outputs and steps
            logLevel: INFO

            inputs:
              - name: city_connect
                query: SELECT
                        *,
                        date_format (NOW(), 'yyyyMMddHHmm') AS version1,
                        NOW() AS ts_city1
                      FROM
                        icebase.retail.city 
            #    dataset: dataos://icebase:retail/city
             #   format: Iceberg
                options: 
                  SSL: "true"
                  driver: "io.trino.jdbc.TrinoDriver"
                  cluster: "system"

              #   schemaPath: dataos://thirdparty01:none/schemas/avsc/city.avsc #schema path is not necessary for icebase to icebase

            outputs:
              - name: city_connect
                dataset: dataos://icebase:retail/city01?acl=rw
                format: Iceberg
                description: City data ingested from retail city
                tags:
                  - retail
                  - city
                options:
                  saveMode: overwrite

First Steps

Workflow Resource in DataOS can be created by applying the manifest file using the DataOS CLI. To learn more about this process, navigate to the link: First steps.

Configuration

Workflows can be configured to autoscale and match varying workload demands, reference pre-defined Secrets and Volumes, and more. DataOS supports two types of Workflows: single-run and scheduled Workflow, each with its own YAML syntax. The specific configurations may vary depending on the use case. For a detailed breakdown of the configuration options and attributes, please refer to the documentation: Attributes of Workflow manifest.

Recipes

Workflows orchestrate Stacks to accomplish myriad tasks. Below are some recipes to help you configure and utilize Workflows effectively: