Object Storage Configurations¶
To execute Flare Jobs on object storage depots such as Amazon S3, Azure ABFSS, Azure WASBS, Google Cloud Storage etc., a corresponding depot must first be created. If the required depot has not yet been created, refer to the following documentation links:
Depots created on top of supported object stores enable uniform interaction across platforms, including Azure Blob File System, Google Cloud Storage, and Amazon S3.
To run a Flare Job, the following information is required:
-
The Uniform Data Locator (UDL) address of the input dataset (for read operations) or the output dataset (for write operations).
-
The file format of the associated data.
Common Configurations¶
Read Configuration¶
For reading the data, we need to configure the name
, dataset
, and format
properties in the inputs
section of the YAML. For instance, if your dataset name is city_connect
, UDL address dataset stored in Azure Blob Storage is dataos://thirdparty01:sampledata/avro
, and the file format is avro
. Then the inputs
section will be as follows-
inputs:
- name: city_connect # name of the dataset
dataset: dataos://thirdparty01:sampledata/avro # address of the input dataset
format: avro # file format: avro, csv, json, orc, parquet, txt, xlsx, xml
Your Flare Jobs can read from multiple data sources. In such a scenario, you have to provide an array of data source definitions as shown below.
inputs:
- name: sample_csv # name of the dataset
dataset: dataos://thirdparty01:none/sample_city.csv # address of the input dataset
format: csv # file format
schemaPath: dataos://thirdparty01:default/schemas/avsc/city.avsc # schema path
schemaType: # schema type
options: # additional options
key1:value1 # Data source-specific options
key2:value2
- name: sample_states # name of the dataset
dataset: dataos://thirdparty01:none/states # address of the input dataset
format: csv # file format
# schema defining
schema: "{\"type\":\"struct\",\"fields\":[{\"name\":\"country_code\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"country_id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"latitude\",\"type\":\"double\",\"nullable\":true,\"metadata\":{}},{\"name\":\"longitude\",\"type\":\"double\",\"nullable\":true,\"metadata\":{}}]}"
- name: transaction_abfss # name of the dataset
dataset: dataos://abfss01:default/transactions_abfss_01 # address of the input dataset
format: avro # file format
options: # additional options
key1:value1 # Data source-specific options
key2:value2
- name: input_customers # name of the dataset
dataset: dataos://lakehouse:retail/customer # address of the input dataset
format: iceberg # file format
Sample Read configuration YAML
Let’s take a case scenario where the dataset is stored in Azure Blob File System (ABFSS) and you have to read data from the source, perform some transformation steps and write it to the Lakehouse, which is a managed depot within the DataOS. The read config YAML will be as follows
version: v1
name: sanity-read-azure
type: workflow
tags:
- Sanity
- Azure
title: Sanity read from Azure
description: |
The purpose of this workflow is to verify if we are able to read different
file formats from azure abfss or not.
workflow:
dag:
- name: sanity-read-az-job
title: Sanity read files from azure abfss
description: |
The purpose of this job is to verify if we are able to read different
file formats from azure abfss or not.
spec:
tags:
- Sanity
- Abfss
stack: flare:6.0
compute: runnable-default
stackSpec:
job:
explain: true
logLevel: INFO
showPreviewLines: 2
inputs:
- name: a_city_csv
dataset: dataos://sanityazure:sanity/azure_write_csv_14?acl=rw
format: csv
- name: a_city_json
dataset: dataos://sanityazure:sanity/azure_write_json
format: json
- name: a_city_parquet
dataset: dataos://sanityazure:sanity/azure_write_parquet
format: parquet
outputs:
# csv
- name: finalDf_csv
dataset: dataos://lakehouse:smoketest/azure_read_csv_14?acl=rw
format: iceberg
options:
saveMode: overwrite
partitionBy:
- version
tags:
- Sanity
- Azure
- CSV
title: Azure csv read sanity
description: Azure csv read sanity
# json
- name: finalDf_json
dataset: dataos://lakehouse:sanity/azure_read_json?acl=rw
format: json
options:
saveMode: overwrite
partitionBy:
- version
tags:
- Sanity
- Azure
- JSON
title: Azure json read sanity
description: Azure json read sanity
# parquet
- name: finalDf_parquet
dataset: dataos://lakehouse:sanity/azure_read_parquet?acl=rw
format: parquet
options:
saveMode: overwrite
partitionBy:
- version
tags:
- Sanity
- Azure
- Parquet
title: Azure parquet read sanity
description: Azure parquet read sanity
steps:
- sequence:
- name: finalDf_csv
sql: SELECT * FROM a_city_csv LIMIT 10
functions:
- name: drop
columns:
- "__metadata_dataos_run_mapper_id"
Write Configuration¶
Note
The ?acl=rw
suffix in the UDL indicates that the Access Control List (ACL) is configured with read-write permissions. The address of the output dataset can also be specified using the format dataos://[depot]:[collection]?acl=rw
. The system will automatically append the name of the output dataset to this address.
For writing the data to a depot on an object store, we need to configure the name
, dataset
and format
properties in the outputs
section of the YAML. For instance, if your dataset is to be stored at the UDL address dataos://thirdparty01:sampledata
by the name output01
and the file format is avro
. Then the outputs
section will be as follows
outputs:
- name: output01 # output name
dataset: dataos://thirdparty01:sampledata?acl=rw # address where the output is to be stored
format: avro # file format: avro, csv, json, orc, parquet, txt, xlsx, xml
Sample Write configuration YAML
Let’s take a case scenario where the output dataset is to be stored in Azure Blob File System Depot (ABFSS), and you have to read data from the Lakehouse depot within the DataOS. The write config YAML will be as follows
version: v1
name: azure-write-01-hive
type: workflow
tags:
- Sanity
- Azure
title: Sanity write to azure
description: |
The purpose of this job is to verify if we are able to write different
file formats into azure or wasbs or not.
workflow:
dag:
- name: azure-write-01-hive
title: Sanity write files to azure
description: |
The purpose of this job is to verify if we are able to write different
file formats into azure or wasbs or not.
spec:
tags:
- Sanity
- Azure
stack: flare:6.0
compute: runnable-default
stackSpec:
job:
explain: true
logLevel: INFO
showPreviewLines: 2
inputs:
- name: sanity_city_input
dataset: dataos://thirdparty01:none/city?acl=rw
format: csv
schemaPath: dataos://thirdparty01:none/schemas/avsc/city.avsc
steps:
- sequence:
- name: cities
doc: Pick all columns from cities and add version as yyyyMMddHHmm formatted
timestamp.
sql: |
SELECT
*,
date_format (now(), 'yyyyMMddHHmm') AS version,
now() AS ts_city
FROM
sanity_city_input limit 10
outputs:
- name: cities
dataset: dataos://azurehiveiceberg:hivetest/azure_hive_iceberg_write_12?acl=rw
format: iceberg
options:
saveMode: overwrite
partitionBy:
- version
tags:
- Sanity
- Azure
- CSV
title: Azure csv sanity
description: Azure csv sanity
Schema Configurations¶
This section describes schema configuration strategies used to manage and customize schemas for supported data sources within the Flare stack. For implementation guidance, refer to the Schema Configurations documentation.
Data Formats Configurations¶
For detailed information for all supported formats, see Source Configurations by Data Formats. The following list provides format-specific configuration references for integrating various data sources with the Flare stack:
-
AVRO – Describes how to configure AVRO files for source ingestion.
-
CSV – Covers options for parsing and validating CSV-formatted input. View CSV configuration
- Iceberg – Provides guidance on configuring Apache Iceberg table formats.
- JSON – Explains how to manage nested structures and data typing for JSON input.
- ORC – Details parameter settings for optimized ingestion of ORC files.
- Parquet – Outlines best practices for reading schema-aware Parquet data.
- Text – Defines configuration options for Plain text data sources.
- XLSX – Specifies how to configure Excel spreadsheet(XLSX) ingestion.
- XML – Provides details on parsing and validating structured XML input.