Retrying Failed Jobs within Workflows¶
This documentation provides information on retrying failed Jobs within Workflows, offering various strategies to handle failures effectively.
To apply a retry strategy to a Job within a Workflow, use the following YAML configuration:
Retry Strategies¶
To counter the scenario of failed job within a Workflow, following retry strategies can be employed:
OnFailure
¶
This strategy involves retrying steps whose main container is marked as failed in Kubernetes. It is the default strategy when no other option is specified.
Always
¶
With this strategy, all steps that encounter failure will be retried.
OnError
¶
Retry steps that encounter errors or whose init or wait containers fail.
OnTransientError
¶
This strategy retries steps that encounter errors defined as transient or errors matching the TRANSIENT_ERROR_PATTERN
environment variable.
Examples¶
Below are two examples demonstrating the use of retry strategies in Workflow configurations.
Click here to view example manifest
Example 1# Resource Section
name: demo-retry
version: v1
type: workflow
tags:
- Flare
description: Ingest data into Raw depot
# Workflow-specific Section
workflow:
title: Demo Ingest Pipeline
dag:
# Job 1 specific Section
- name: connect-customer
file: flare/connect-customer/config_v1.yaml
retry: # Retry configuration
count: 2
strategy: "OnFailure"
# Job 2 specific Section
- name: connect-customer-dt
file: flare/connect-customer/dataos-tool_v1.yaml
dependencies:
- connect-customer
# Resource Section
name: c360-daggy
version: v1
type: workflow
tags:
- Flare
description: Ingest data into Raw depot
# Workflow-specific Section
workflow:
title: TWT Demo Ingest Pipeline
dag:
# Job 1 specific Section
- name: connect-customer
file: flare/connect-customer/configv1.yaml
retry: # Retry configuration
count: 2
strategy: "OnTransientError"
# Job 2 specific Section
- name: connect-customer-dt
file: flare/connect-customer/dataos-tool_v1.yaml
dependencies:
- connect-customer