Skip to content

Attributes of Workflow YAML Configuration

Structure of a Workflow manifest

worker_manifest_reference.yml
name: ${resource_name} # Name of the Resource (mandatory)
version: v1beta # Manifest version of the Resource (mandatory)
type: worker # Type of Resource (mandatory)
tags: # Tags for categorizing the Resource (optional)
  - ${tag_example_1} 
  - ${tag_example_2} 
description: ${resource_description} # Description (optional)
owner: ${resource_owner} # Owner of the Resource (optional, default value: user-id of user deploying the resource)
layer: ${resource_layer} # DataOS Layer (optional, default value: user)
workflow:
  title: ${title of workflow}
  schedule: 
    cron: ${*/10 * * * *}
    concurrencyPolicy: ${Allow}
    endOn: ${2022-01-01T23:40:45Z}
    timezone: ${Asia/Kolkata}

  dag: 
    - name: ${job1-name}
      description: ${description}
      title: ${title of job}
      tags:
        - ${tag1}
        - ${tag2}
      gcWhenComplete: true
      spec: 
        stack: ${flare:5.0}
        logLevel: ${INFO}
        configs: 
          ${alpha: beta}
        envs: 
          ${random: delta}
        secrets: 
          - ${mysecret}
        dataosSecrets:
          - name: ${mysecret}
            workspace: ${curriculum}
            key: ${newone}
            keys:
              - ${newone}
              - ${oldone}
            allKeys: ${true}
            consumptionType: ${envVars}
        dataosVolumes: 
          - name: ${myVolume}
            directory: ${/file}
            readOnly: ${true}
            subPath: ${/random}
        tempVolume: ${abcd}
        persistentVolume:
          name: ${myVolume}
          directory: ${/file}
          readOnly: ${true}
          subPath: ${/random}
        compute: ${compute resource name}
          requests:
            cpu: ${100Mi}
            memory: ${100Gi}
          limits:
            cpu: ${100Mi}
            memory: ${100Gi}
        dryRun: ${true}
        runAsApiKey: ${abcdefghijklmnopqrstuvwxyz}
        runAsUser: ${iamgroot}
        topology:
          name: ${abcd} 
          type: ${efgh} 
          doc: ${abcd efgh}
          properties: 
            ${alpha: random}
          dependencies: 
            - ${abc}
        file: ${abcd}
        retry: 
          count: ${2} 
          strategy: ${"OnTransientError"}
          duration: <string>
          maxDuration: <string> 

Structure of Workflow YAML configuration

Configuration

Resource meta section

This section serves as the header of the manifest file, defining the overall characteristics of the Worker Resource you wish to create. It includes attributes common to all types of Resources in DataOS. These attributes help DataOS in identifying, categorizing, and managing the Resource within its ecosystem. To learn about the Resources of this section, refer to the following link: Attributes of Resource meta section.

Workflow-specific Section

This section comprises attributes specific to the Workflow Resource. The attributes within the section are listed below:

workflow

Description: workflow section

Data Type Requirement Default Value Possible Value
mapping mandatory none none

Example Usage:

workflow: 
  schedule: 
    cron: '*/10 * * * *' 
  dag: 
    {} # List of Jobs

title

Description: Title of Workflow

Data Type Requirement Default Value Possible Value
string optional none any string

Example Usage:

title: Quality Assessment Workflow 

schedule

Description: schedule section

Data Type Requirement Default Value Possible Value
mapping optional (mandatory for
Scheduled Workflows)
none none

Example Usage:

schedule: 
  cron: '*/10 * * * *' 
  concurrencyPolicy: Forbid 

cron

Description: the cron field encompasses the cron expression, a string that comprises six or seven sub-expressions providing specific details of the schedule.

Data Type Requirement Default Value Possible Value
string optional (mandatory for
Scheduled Workflows)
none any valid cron expression

Additional Details: the cron expression consists of value separated by white spaces, make sure there are no formatting issues.
Example Usage:

cron: '*/10 * * * *' 

concurrencyPolicy

Description: the concurrencyPolicy attribute determines how concurrent executions of a Workflow, created by a scheduled Workflow, are handled

Data Type Requirement Default Value Possible Value
string optional Allow Allow/Forbid/Replace

Additional Details:

  • concurrencyPolicy: Forbid - When the concurrencyPolicy is set to Forbid, the Schedule/Cron Workflow strictly prohibits concurrent runs. In this scenario, if it is time for a new Workflow run and the previous Workflow run is still in progress, the cron Workflow will skip the new Workflow run altogether.
  • concurrencyPolicy: Allow - On the other hand, setting the concurrencyPolicy to Allow enables the Schedule/Cron Workflow to accommodate concurrent executions. If it is time for a new Workflow run and the previous Workflow run has not completed yet, the cron Workflow will proceed with the new Workflow run concurrently.
  • concurrencyPolicy: Replace - When the concurrencyPolicy is set to Replace, the Schedule/Cron Workflow handles concurrent executions by replacing the currently running Workflow run with a new Workflow run if it is time for the next job Workflow and the previous one is still in progress.

Example Usage:

concurrencyPolicy: Replace 

endOn

Description: endOn terminates the scheduled Workflow run at the specified time, even if the last workflow run that got triggered before the threshold time isn’t complete

Data Type Requirement Default Value Possible Value
string optional none any time provided in ISO 8601 format

Example Usage:

The timestamp 2022-01-01T23:30:45Z follows the ISO 8601 format:

  • Date: 2022-01-01 (YYYY-MM-DD)
  • T: Separator indicating the start of the time portion in the datetime string.
  • Time: 23:30:45 (hh:mm:ss)
  • Timezone: Z (UTC)
  • Z: Indicates the time is in Coordinated Universal Time (UTC), also known as Zulu time.

It represents January 1, 2022, at 23:30:45 UTC.

endOn: 2022-01-01T23:30:45Z 

timezone

Description: Time zone for scheduling the workflow.

Data Type Requirement Default Value Possible Value
string mandatory none Asia/Kolkata, America/Los_Angeles, etc

Example Usage:

timezone: Asia/Kolkata

dag

Description: DAG is a Directed Acyclic Graph, a conceptual representation of a sequence of jobs (or activities). These jobs in a DAG are executed in the order of dependencies between them.

Data Type Requirement Default Value Possible Value
mapping mandatory none none

Additional Details: there should be atleast one job within a DAG
Example Usage:

dag: 
  - name: profiling-job 
    spec: 
      stack: flare:5.0 
      compute: runnable-default 
      stackSpec: 
        {} # Flare Stack-specific attributes

name

Description: name of the Job

Data Type Requirement Default Value Possible Value
string mandatory none any string confirming the regex
[a-z0-9]([-a-z0-9]*[a-z0-9]) and
length less than or equal to 48

Example Usage:

name: flare-ingestion-job 


title

Description: title of Job

Data Type Requirement Default Value Possible Value
string optional none any string

Example Usage:

title: Profiling Job 


description

Description: text describing the Job

Data Type Requirement Default Value Possible Value
string optional none any string

Example Usage:

description: The job ingests customer data 

tags

Description: tags associated with the Workflow.

Data Type Requirement Default Value Possible Value
list of strings optional none valid tags

Example Usage:

tags:
  - tag1
  - tag2

gcWhenComplete

Description: tags associated with the Workflow.

Data Type Requirement Default Value Possible Value
list of strings optional none valid tags

Example Usage:

  tags:
    - tag1
    - tag2

spec

Description: Specs of the Job.

Data Type Requirement Default Value Possible Value
mapping mandatory none none

Example Usage:

spec: 
  stack: flare:5.0 
  compute: runnable-default 
  stackSpec: 
    {} # Flare Stack specific configurations

stack

Description: The name and version of the Stack Resource which the Workflow orchestrates.

Data Type Requirement Default Value Possible Value
string mandatory none flare/toolBox/scanner/dataos-ctl/soda+python/steampipestack

Additional Details: To know more about each stack, go to Stack.

Example Usage:

  stack: flare

logLevel

Description:  The log level for the Service classifies entries in logs in terms of urgency which helps to filter logs during search and helps control the amount of information in logs.

Data Type Requirement Default Value Possible Value
string optional INFO INFO, WARN, DEBUG, ERROR

Additional Details: 

  • INFO: Designates informational messages that highlight the progress of the service.

  • WARN: Designates potentially harmful situations.

  • DEBUG: Designates fine-grained informational events that are most useful while debugging.

  • ERROR: Designates error events that might still allow the workflow to continue running.

Example Usage:

workflow:
    logLevel: DEBUG

configs

Description: additional optional configuration for the service.

Data Type Requirement Default Value Possible Value
mapping optional none key-value configurations

Example Usage:

configs:
  key1: value1
  key2: value2

envs

Description: environment variables for the Workflow.

Data Type Requirement Default Value Possible Value
mapping optional none key-value configurations

Example Usage:

envs:
  DEPOT_SERVICE_URL: http://depotservice-api.depot.svc.cluster.local:8000/ds/
  HTTP_CONNECT_TIMEOUT_MS: 60000
  HTTP_SOCKET_TIMEOUT_MS: 60000

Additional Details:

  • DEPOT_SERVICE_URL: Specifies the base URL for the Depot Service API. This is the endpoint that the service interacts with for managing Depots.
  • HTTP_CONNECT_TIMEOUT_MS: Defines the connection timeout for HTTP requests, in milliseconds. If a connection to a remote server cannot be established within this timeframe (60 seconds in this case), the request will timeout. This ensures that the workload does not hang indefinitely while attempting to connect.
  • HTTP_SOCKET_TIMEOUT_MS: Sets the socket timeout for HTTP requests, in milliseconds. This controls the maximum time that the service will wait for data after a connection has been established. If data is not received from the connected server within this period (60 seconds), the request will timeout. This helps prevent long delays in response handling when waiting for data transfer.

secrets

Description: list of secrets associated with the Workflow.

Data Type Requirement Default Value Possible Value
list of strings optional none none

Example Usage:

secrets:
  - mysecret

dataosSecrets

Description: list of DataOS Secrets associated with the Workflow. Each DataOS Secret is a mapping containing various attributes.

Data Type Requirement Default Value Possible Value
list of mappings optional none none

Example Usage:

workflow:
  dataosSecrets:
    - name: mysecret
      workspace: curriculum
      key: newone
      keys:
        - newone
        - oldone
      allKeys: true
      consumptionType: envVars

dataosVolumes

Description: list of DataOS Volumes associated with the Workflow. Each DataOS Volume is a mapping containing various attributes.

Data Type Requirement Default Value Possible Value
list of mappings optional none none

Example Usage:

dataosVolumes:
  - name: myVolume
    directory: /file
    readOnly: true
    subPath: /random

tempVolume

Description: The temporary volume of the Workflow.

Data Type Requirement Default Value Possible Value
string optional none any valid Volume name

Example Usage:

tempVolume: abcd

persistentVolume

Description: configuration for the persistent volume associated with the Workflow.

Data Type Requirement Default Value Possible Value
mapping optional none none

Example Usage:

  persistentVolume:
    name: myVolume
    directory: /file
    readOnly: true
    subPath: /random

compute

Description: the name of the Compute Resource for the Workflow.

Data Type Requirement Default Value Possible Value
string mandatory none valid runnable-type Compute Resource name.

Example Usage:

  compute: MyComputeResource

resources

Description: Resource requests and limits for the Workflow. This includes CPU and memory specifications.

Data Type Requirement Default Value Possible Value
mapping optional none none

Example Usage:

  resources:
    requests:
      cpu: 100Mi
      memory: 100Gi
    limits:
      cpu: 100Mi
      memory: 100Gi

dryRun

Description: Indicates whether the workflow is in dry run mode. When enabled, the dryRun property deploys the Workflow to the cluster without submitting it.

Data Type Requirement Default Value Possible Value
boolean optional true true or false.

Example Usage:

  dryRun: true

runAsUser

Description: when the runAsUser attribute is configured with the UserID of the use-case assignee, it grants the authority to perform operations on behalf of that user.

Data Type Requirement Default Value Possible Value
string optional none userID of the Use
Case Assignee

Example Usage:

runAsUser: iamgroot 


runAsApiKey

Description: The runAsApiKey attribute allows a user to assume another user's identity by providing the latter's API key.

Data Type Requirement Default Value Possible Value
string mandatory none any valid API key.

Additional Details: The apikey can be obtained by executing the following command from the CLI:

dataos-ctl user apikey get

In case no apikey is available, the below command can be run to create a new apikey

dataos-ctl user apikey create -n ${{name of the apikey}} -d ${{duration for the apikey to live}}

Example Usage:

runAsApiKey: abcdefghijklmnopqrstuvwxyz

topology

Description: The topology attribute is used to define the topology of the Workflow. It specifies the elements and dependencies within the Workflow's topology.

Data Type Requirement Default Value Possible Values
list of mappings mandatory none list of topology element definitions

Example Usage:

topology:
  - name: random            # mandatory
    type: alpha             # mandatory
    doc: new                # Documentation for the element
    properties:
      random: lost          # Custom properties for the element
    dependencies:
      - new1
      - new2

file

Description: attribute for specifying the file path for a Workflow YAML

Data Type Requirement Default Value Possible Value
string optional none none

Example Usage:

file: workflow/new/random.yaml

retry

Description: retrying failed jobs

Data Type Requirement Default Value Possible Value
mapping optional none none

Example Usage:

retry: 
  count: 2 
  strategy: "OnFailure" 

count

Description: count post which retry occurs

Data Type Requirement Default Value Possible Value
integer optional none any positive integer

Example Usage:

count: 2 

strategy

Description: strategies to choose which job failures to retry

Data Type Requirement Default Value Possible Value
string optional none Always/OnFailure/
OnError/OnTransientError

Additional Details:
- Always - Retry all failed steps.
- OnFailure - Retry steps whose main container is marked as failed in Kubernetes (this is the default).
- OnError - Retry steps that encounter errors or whose init or wait containers fail.
- OnTransientError - Retry steps that encounter errors defined as transient or errors matching the TRANSIENT_ERROR_PATTERN environment variable.

Example Usage:

strategy: "OnTransientError" 

dependencies

Description: specifies the dependency between jobs/Workflows

Data Type Requirement Default Value Possible Value
string optional none none

Example Usage:

dependencies: job2