Object Storage Depots¶
To execute Flare Jobs on top of object storage depots, like Amazon S3, Azure ABFSS, Azure WASBS, Google Cloud Storage, etc. you first need to create a depot. If you have already created a depot, continue reading.
By creating depots on top of Object Stores, interaction can be done in a uniform way with all supported storages, i.e., Azure Blob File System, Google Cloud Storage, and Amazon S3. To run a Flare Job all you need is the UDL address of the input or output dataset for the reading and writing scenarios, respectively. Apart from this, you also need the file format
of the data.
Common Configurations¶
Read Config¶
For reading the data, we need to configure the name
, dataset
, and format
properties in the inputs
section of the YAML. For instance, if your dataset name is city_connect
, UDL address dataset stored in Azure Blob Storage is dataos://thirdparty01:sampledata/avro
, and the file format is avro
. Then the inputs
section will be as follows-
inputs:
- name: city_connect # name of the dataset
dataset: dataos://thirdparty01:sampledata/avro # address of the input dataset
format: avro # file format: avro, csv, json, orc, parquet, txt, xlsx, xml
Your Flare Jobs can read from multiple data sources. In such a scenario, you have to provide an array of data source definitions as shown below.
inputs:
- name: sample_csv # name of the dataset
dataset: dataos://thirdparty01:none/sample_city.csv # address of the input dataset
format: csv # file format
schemaPath: dataos://thirdparty01:default/schemas/avsc/city.avsc # schema path
schemaType: # schema type
options: # additional options
key1:value1 # Data source-specific options
key2:value2
- name: sample_states # name of the dataset
dataset: dataos://thirdparty01:none/states # address of the input dataset
format: csv # file format
schema: "{\"type\":\"struct\",\"fields\":[{\"name\":\"country_code\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"country_id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"latitude\",\"type\":\"double\",\"nullable\":true,\"metadata\":{}},{\"name\":\"longitude\",\"type\":\"double\",\"nullable\":true,\"metadata\":{}}]}" # schema
- name: transaction_abfss # name of the dataset
dataset: dataos://abfss01:default/transactions_abfss_01 # address of the input dataset
format: avro # file format
options: # additional options
key1:value1 # Data source-specific options
key2:value2
- name: input_customers # name of the dataset
dataset: dataos://icebase:retail/customer # address of the input dataset
format: iceberg # file format
Sample Read configuration YAML
Let’s take a case scenario where the dataset is stored in Azure Blob File System (ABFSS) and you have to read data from the source, perform some transformation steps and write it to the Icebase, which is a managed depot within the DataOS. The read config YAML will be as follows
version: v1
name: sanity-read-azure
type: workflow
tags:
- Sanity
- Azure
title: Sanity read from Azure
description: |
The purpose of this workflow is to verify if we are able to read different
file formats from azure abfss or not.
workflow:
dag:
- name: sanity-read-az-job
title: Sanity read files from azure abfss
description: |
The purpose of this job is to verify if we are able to read different
file formats from azure abfss or not.
spec:
tags:
- Sanity
- Abfss
stack: flare:6.0
compute: runnable-default
flare:
job:
explain: true
logLevel: INFO
showPreviewLines: 2
inputs:
- name: a_city_csv
dataset: dataos://sanityazure:sanity/azure_write_csv_14?acl=rw
format: csv
# - name: a_city_json
# dataset: dataos://sanityazure:sanity/azure_write_json
# format: json
# - name: a_city_parquet
# dataset: dataos://sanityazure:sanity/azure_write_parquet
# format: parquet
outputs:
# csv
- name: finalDf_csv
dataset: dataos://icebase:smoketest/azure_read_csv_14?acl=rw
format: iceberg
options:
saveMode: overwrite
partitionBy:
- version
tags:
- Sanity
- Azure
- CSV
title: Azure csv read sanity
description: Azure csv read sanity
# # json
# - name: finalDf_json
# dataset: dataos://icebase:sanity/azure_read_json?acl=rw
# format: json
# options:
# saveMode: overwrite
# partitionBy:
# - version
# tags:
# - Sanity
# - Azure
# - JSON
# title: Azure json read sanity
# description: Azure json read sanity
# # parquet
# - name: finalDf_parquet
# dataset: dataos://icebase:sanity/azure_read_parquet?acl=rw
# format: parquet
# options:
# saveMode: overwrite
# partitionBy:
# - version
# tags:
# - Sanity
# - Azure
# - Parquet
# title: Azure parquet read sanity
# description: Azure parquet read sanity
steps:
- sequence:
- name: finalDf_csv
sql: SELECT * FROM a_city_csv LIMIT 10
functions:
- name: drop
columns:
- "__metadata_dataos_run_mapper_id"
Write Config¶
Note: the
?acl=rw
after the UDL signifies Access Control List with Read Write Access. You can also specify the address of the output dataset in the formatdataos://[depot]:[collection]?acl=rw.
The name of the output dataset will automatically get appended to it.
For writing the data to a depot on an object store, we need to configure the name
, dataset
and format
properties in the outputs
section of the YAML. For instance, if your dataset is to be stored at the UDL address dataos://thirdparty01:sampledata
by the name output01
and the file format is avro
. Then the outputs
section will be as follows
outputs:
- name: output01 # output name
dataset: dataos://thirdparty01:sampledata?acl=rw # address where the output is to be stored
format: avro # file format: avro, csv, json, orc, parquet, txt, xlsx, xml
Sample Write configuration YAML
Let’s take a case scenario where the output dataset is to be stored in Azure Blob File System Depot (ABFSS), and you have to read data from the Icebase depot within the DataOS. The write config YAML will be as follows
version: v1
name: azure-write-01-hive
type: workflow
tags:
- Sanity
- Azure
title: Sanity write to azure
description: |
The purpose of this job is to verify if we are able to write different
file formats into azure or wasbs or not.
workflow:
dag:
- name: azure-write-01-hive
title: Sanity write files to azure
description: |
The purpose of this job is to verify if we are able to write different
file formats into azure or wasbs or not.
spec:
tags:
- Sanity
- Azure
stack: flare:6.0
compute: runnable-default
flare:
job:
explain: true
logLevel: INFO
showPreviewLines: 2
inputs:
- name: sanity_city_input
dataset: dataos://thirdparty01:none/city?acl=rw
format: csv
schemaPath: dataos://thirdparty01:none/schemas/avsc/city.avsc
steps:
- sequence:
- name: cities
doc: Pick all columns from cities and add version as yyyyMMddHHmm formatted
timestamp.
sql: |
SELECT
*,
date_format (now(), 'yyyyMMddHHmm') AS version,
now() AS ts_city
FROM
sanity_city_input limit 10
outputs:
- name: cities
dataset: dataos://azurehiveiceberg:hivetest/azure_hive_iceberg_write_12?acl=rw
format: iceberg
options:
saveMode: overwrite
partitionBy:
- version
tags:
- Sanity
- Azure
- CSV
title: Azure csv sanity
description: Azure csv sanity
Advanced Configurations¶
Data Format Configurations¶
This section will provide comprehensive information on how to provide advanced source configurations when working with different data sources using DataOS’ Flare stack.
Refer to the link below to know more.
Source Configurations by Data Formats
Schema Configurations¶
This section delves into the realm of schema configurations, offering invaluable insights into managing and customizing schemas for various data sources in Flare. Refer to the link below.