Workflow¶
The Workflow in DataOS serves as a Resource for orchestrating data processing tasks with dependencies. It enables the creation of complex data workflows by defining a hierarchy based on a dependency mechanism. To know more about workflow, refer to the following link: Core Concepts.
Workflow in the Data Product Lifecycle
Workflows are integral to the transformation phase in the Data Product Lifecycle. They are particularly useful when your transformation involves:
-
Definite Execution:sequences of tasks, jobs, or processes batch data, terminating upon successful completion or failure. For example, A workflow moving data from point A to point B.
-
Execution Processes: To process data in discrete chunks, in parallel or in a given ordered sequence of jobs in DAGs.
-
Independent Processing: Performs data-transformation, ingestion, or syndication or automating job execution based on a cron expression.
Structure of Workflow manifest¶
# Resource Section
#This is a icebase to icebase workflow hence it's way of giving input and output is diff.
version: v1
name: wf-tmdc-01
type: workflow
tags:
- Connect
- City
description: The job ingests city data from dropzone into raw zone
# Workflow-specific Section
workflow:
title: Connect City
dag:
# Job 1 Specific Section
- name: wf-job1 # Job 1 name
title: City Dimension Ingester
description: The job ingests city data from dropzone into raw zone
spec:
tags:
- Connect
- City
stack: flare:5.0 # The job gets executed upon the Flare Stack, so its a Flare Job
compute: runnable-default
# Flare Stack-specific Section
stackSpec:
driver:
coreLimit: 1100m
cores: 1
memory: 1048m
job:
explain: true #job section will contain explain, log-level, inputs, outputs and steps
logLevel: INFO
inputs:
- name: city_connect
query: SELECT
*,
date_format (NOW(), 'yyyyMMddHHmm') AS version1,
NOW() AS ts_city1
FROM
icebase.retail.city
# dataset: dataos://icebase:retail/city
# format: Iceberg
options:
SSL: "true"
driver: "io.trino.jdbc.TrinoDriver"
cluster: "system"
# schemaPath: dataos://thirdparty01:none/schemas/avsc/city.avsc #schema path is not necessary for icebase to icebase
outputs:
- name: city_connect
dataset: dataos://icebase:retail/city01?acl=rw
format: Iceberg
description: City data ingested from retail city
tags:
- retail
- city
options:
saveMode: overwrite
First Steps¶
Workflow Resource in DataOS can be created by applying the manifest file using the DataOS CLI. To learn more about this process, navigate to the link: First steps.
Configuration¶
Workflows can be configured to autoscale and match varying workload demands, reference pre-defined Secrets and Volumes, and more. DataOS supports two types of Workflows: single-run and scheduled Workflow, each with its own YAML syntax. The specific configurations may vary depending on the use case. For a detailed breakdown of the configuration options and attributes, please refer to the documentation: Attributes of Workflow manifest.
Recipes¶
Workflows orchestrate Stacks to accomplish myriad tasks. Below are some recipes to help you configure and utilize Workflows effectively: