Retrying Failed Jobs within Workflows¶

This documentation provides information on retrying failed Jobs within Workflows, offering various strategies to handle failures effectively.

To apply a retry strategy to a Job within a Workflow, use the following YAML configuration:

CommandCommand

retry:
  count: {{retry count}}
  strategy: {{retry strategy}} # Possible Strategies: Always/OnFailure/OnError/OnTransientError

retry:
  count: 2
  strategy: OnFailure # Possible Strategies: Always/OnFailure/OnError/OnTransientError

Retry Strategies¶

To counter the scenario of failed job within a Workflow, following retry strategies can be employed:

`OnFailure`¶

This strategy involves retrying steps whose main container is marked as failed in Kubernetes. It is the default strategy when no other option is specified.

`Always`¶

With this strategy, all steps that encounter failure will be retried.

`OnError`¶

Retry steps that encounter errors or whose init or wait containers fail.

`OnTransientError`¶

This strategy retries steps that encounter errors defined as transient or errors matching the TRANSIENT_ERROR_PATTERN environment variable.

Examples¶

Below are two examples demonstrating the use of retry strategies in Workflow configurations.

Click here to view example manifest

Example 1

# Resource Section
name: demo-retry
version: v1
type: workflow
tags:
  - Flare
description: Ingest data into Raw depot

# Workflow-specific Section
workflow:
  title: Demo Ingest Pipeline
  dag:

# Job 1 specific Section
    - name: connect-customer
      file: flare/connect-customer/config_v1.yaml
      retry: # Retry configuration
        count: 2
        strategy: "OnFailure"

# Job 2 specific Section
    - name: connect-customer-dt
      file: flare/connect-customer/dataos-tool_v1.yaml
      dependencies:
        - connect-customer

Example 2

# Resource Section
name: c360-daggy
version: v1
type: workflow
tags:
- Flare
description: Ingest data into Raw depot

# Workflow-specific Section
workflow:
  title: TWT Demo Ingest Pipeline
  dag:

# Job 1 specific Section
  - name: connect-customer
    file: flare/connect-customer/configv1.yaml
    retry: # Retry configuration
      count: 2
      strategy: "OnTransientError"

# Job 2 specific Section
  - name: connect-customer-dt
    file: flare/connect-customer/dataos-tool_v1.yaml
    dependencies:
      - connect-customer

Retrying Failed Jobs within Workflows¶

Retry Strategies¶

OnFailure¶

Always¶

OnError¶

OnTransientError¶

Examples¶

`OnFailure`¶

`Always`¶

`OnError`¶

`OnTransientError`¶