Skip to content

Single-Run Workflow

A Single-run Workflow represent a single-time execution of a sequence of jobs. It does not include a schedule section.

Illustration of a Single-run Workflow
Illustration of a Single-run Workflow

Code Snippet

The following code snippet illustrates a Workflow with two jobs. The first one executes upon Flare Stack that reads data from the thirdparty01 depot in batch mode and subsequently writes to the destination Depot. Once the first job, completes it execution the second job starts execution, which does the profiling on the data using same flare stack.

Sample Single Run Workflow manifest
workflow.yml
version: v1
name: wf-city
type: workflow
owner: iamgroot
tags:
  - Tier.Gold
  - mpulse.altamed
description: The "wf-city" is a data pipeline focused on ingesting city data from lakehouse to lakehouse AltaMed healthcare provider. It involves stages such as data ingestion, tranformation and profiling.
workflow:
  title: City Data Pipeline
  dag: 
    - name: city-data-ingestion
      title: City Dimension Ingester
      description: The job ingests city data from dropzone into raw zone
      spec:
        tags:
          - Connect
          - City
        stack: flare:5.0 # The job gets executed upon the Flare Stack, so its a Flare Job
        compute: runnable-default

        # Flare Stack-specific Section
        stackSpec:
          driver:
            coreLimit: 1100m
            cores: 1
            memory: 1048m
          job:
            explain: true  #job section will contain explain, log-level, inputs, outputs and steps
            logLevel: INFO

            inputs:
              - name: city_connect
                query: SELECT
                        *,
                        date_format (NOW(), 'yyyyMMddHHmm') AS version1,
                        NOW() AS ts_city1
                      FROM
                        lakehouse.retail.city 
                options: 
                  SSL: "true"
                  driver: "io.trino.jdbc.TrinoDriver"
                  cluster: "system"

            outputs:
              - name: city_connect
                dataset: dataos://lakehouse:retail/city01?acl=rw
                format: Iceberg
                description: City data ingested from retail city
                tags:
                  - retail
                  - city
                options:
                  saveMode: overwrite

    - name: lakehouse-city-profiling
      title: Profiler City01
      description: The job performs profiling on city01 data
      spec:
        envs:
          DISABLE_RAW_FILE_SYSTEM_PERMISSION_SET: "true"
        tags:
          - Fides
          - City
          - has_profile
        stack: flare:5.0
        compute: runnable-default
        title: City Profile

        persistentVolume: # Define Persistent Volume
          name: persistent-v
          directory: fides 
        stackSpec:
          driver:
            coreLimit: 2400m
            cores: 2
            memory: 3072m
          executor:
            coreLimit: 2400m
            cores: 2
            instances: 1
            memory: 4096m
          job:
            explain: true
            logLevel: WARN

            inputs:
              - name: profile_city
                dataset: dataos://lakehouse:retail/city01 # Dataset Name
                format: iceberg

            profile:
              # basic | intermediate | advanced
              level: basic
              filters:
                - type: expression
                  expression: "state_code='AL'" # Filter Expression

          sparkConf:
            - spark.sql.shuffle.partitions: 10
            - spark.default.parallelism: 10

    dependencies:
      - city-data-ingestion

Here are it's workflow and profiling manifests to do a hands on:

Sample Workflows

workflow.yml
# Resource Section
#This is a lakehouse to lakehouse workflow hence it's way of giving input and output is diff.
version: v1 
name: wf-tmdc-01 
type: workflow 
tags:
  - Connect
  - City
description: The job ingests city data from dropzone into raw zone

# Workflow-specific Section
workflow:
  dag: 

  # Job 1 Specific Section
    - name: wf-job1 # Job 1 name
      spec:
        tags:
          - Connect
          - City
        stack: flare:6.0 # The job gets executed upon the Flare Stack, so its a Flare Job
        compute: runnable-default

        # Flare Stack-specific Section
        stackSpec:
          driver:
            coreLimit: 1100m
            cores: 1
            memory: 1048m
          job:
            explain: true  #job section will contain explain, log-level, inputs, outputs and steps
            logLevel: INFO

            inputs:
              - name: city_connect
                query: SELECT
                        *,
                        date_format (NOW(), 'yyyyMMddHHmm') AS version1,
                        NOW() AS ts_city1
                      FROM
                        lakehouse.retail.city 
                #   dataset: dataos://lakehouse:retail/city
                #   format: Iceberg
                options: 
                  SSL: "true"
                  driver: "io.trino.jdbc.TrinoDriver"
                  cluster: "system"

                #   schemaPath: dataos://thirdparty01:none/schemas/avsc/city.avsc #schema path is not necessary for lakehouse to lakehouse

            outputs:
              - name: city_connect
                dataset: dataos://lakehouse:retail/city01?acl=rw
                format: Iceberg
                description: City data ingested from retail city
                tags:
                  - retail
                  - city
                options:
                  saveMode: overwrite
profiling.yaml
version: v1
name: profiler-raw-city-01
type: workflow
tags:
  - Fides
  - City
  - has_profile
description: The job performs profiling on top of city data
workflow:
  title: City Profiler
  dag:
    - name: profiling-city01
      spec:
        envs:
          DISABLE_RAW_FILE_SYSTEM_PERMISSION_SET: "true"
        tags:
          - Fides
          - City
          - has_profile
        stack: flare:5.0
        compute: runnable-default

        persistentVolume: # Define Persistent Volume
          name: persistent-v
          directory: fides 
        stackSpec:
          driver:
            coreLimit: 2400m
            cores: 2
            memory: 3072m
          executor:
            coreLimit: 2400m
            cores: 2
            instances: 1
            memory: 4096m
          job:
            explain: true
            logLevel: WARN

            inputs:
              - name: profile_city
                dataset: dataos://lakehouse:retail/city01 # Dataset Name
                format: iceberg

            profile:
              # basic | intermediate | advanced
              level: basic
              filters:
                - type: expression
                  expression: "state_code='AL'" # Filter Expression

          sparkConf:
            - spark.sql.shuffle.partitions: 10
            - spark.default.parallelism: 10