Object Storage Depots¶

To execute Flare Jobs on top of object storage depots, like Amazon S3, Azure ABFSS, Azure WASBS, Google Cloud Storage, etc. you first need to create a depot. If you have already created a depot, continue reading.

By creating depots on top of Object Stores, interaction can be done in a uniform way with all supported storages, i.e., Azure Blob File System, Google Cloud Storage, and Amazon S3. To run a Flare Job all you need is the UDL address of the input or output dataset for the reading and writing scenarios, respectively. Apart from this, you also need the file format of the data.

Common Configurations¶

Read Config¶

For reading the data, we need to configure the name, dataset, and format properties in the inputs section of the YAML. For instance, if your dataset name is city_connect, UDL address dataset stored in Azure Blob Storage is dataos://thirdparty01:sampledata/avro, and the file format is avro. Then the inputs section will be as follows-

inputs:
    - name: city_connect # name of the dataset
    dataset: dataos://thirdparty01:sampledata/avro # address of the input dataset
    format: avro # file format: avro, csv, json, orc, parquet, txt, xlsx, xml

Your Flare Jobs can read from multiple data sources. In such a scenario, you have to provide an array of data source definitions as shown below.

inputs:  
  - name: sample_csv # name of the dataset
    dataset: dataos://thirdparty01:none/sample_city.csv # address of the input dataset
    format: csv # file format
    schemaPath: dataos://thirdparty01:default/schemas/avsc/city.avsc # schema path
    schemaType: # schema type
    options: # additional options
      key1:value1                             # Data source-specific options
      key2:value2

  - name: sample_states # name of the dataset
    dataset: dataos://thirdparty01:none/states # address of the input dataset
    format: csv # file format
    schema: "{\"type\":\"struct\",\"fields\":[{\"name\":\"country_code\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"country_id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"latitude\",\"type\":\"double\",\"nullable\":true,\"metadata\":{}},{\"name\":\"longitude\",\"type\":\"double\",\"nullable\":true,\"metadata\":{}}]}" # schema

  - name: transaction_abfss # name of the dataset
    dataset: dataos://abfss01:default/transactions_abfss_01 # address of the input dataset
    format: avro # file format
    options: # additional options
      key1:value1                            # Data source-specific options
      key2:value2

  - name: input_customers # name of the dataset
    dataset: dataos://icebase:retail/customer # address of the input dataset
    format: iceberg # file format

Sample Read configuration YAML

Let’s take a case scenario where the dataset is stored in Azure Blob File System (ABFSS) and you have to read data from the source, perform some transformation steps and write it to the Icebase, which is a managed depot within the DataOS. The read config YAML will be as follows

object_storage_depots_read.yml

version: v1
name: sanity-read-azure
type: workflow
tags:
- Sanity
- Azure
title: Sanity read from Azure
description: |
  The purpose of this workflow is to verify if we are able to read different
  file formats from azure abfss or not. 

workflow:
  dag:
  - name: sanity-read-az-job
    title: Sanity read files from azure abfss
    description: |
      The purpose of this job is to verify if we are able to read different
      file formats from azure abfss or not. 
    spec:
      tags:
      - Sanity
      - Abfss
      stack: flare:6.0
      compute: runnable-default
      flare:
        job:
          explain: true
          logLevel: INFO
          showPreviewLines: 2
          inputs:
          - name: a_city_csv
            dataset: dataos://sanityazure:sanity/azure_write_csv_14?acl=rw
            format: csv
          # - name: a_city_json
          #   dataset: dataos://sanityazure:sanity/azure_write_json
          #   format: json
          # - name: a_city_parquet
          #   dataset: dataos://sanityazure:sanity/azure_write_parquet
          #   format: parquet


          outputs:
            # csv
            - name: finalDf_csv
              dataset: dataos://icebase:smoketest/azure_read_csv_14?acl=rw
              format: iceberg
              options:
                saveMode: overwrite
                partitionBy:
                  - version
              tags:
                - Sanity
                - Azure
                - CSV
              title: Azure csv read sanity
              description: Azure csv read sanity
            # # json
            # - name: finalDf_json
            #   dataset: dataos://icebase:sanity/azure_read_json?acl=rw
            #   format: json
            #   options:
            #     saveMode: overwrite
            #     partitionBy:
            #       - version
            #   tags:
            #     - Sanity
            #     - Azure
            #     - JSON
            #   title: Azure json read sanity
            #   description: Azure json read sanity  
            # # parquet
            # - name: finalDf_parquet
            #   dataset: dataos://icebase:sanity/azure_read_parquet?acl=rw
            #   format: parquet
            #   options:
            #     saveMode: overwrite
            #     partitionBy:
            #       - version
            #   tags:
            #     - Sanity
            #     - Azure
            #     - Parquet
            #   title: Azure parquet read sanity
            #   description: Azure parquet read sanity
          steps:
            - sequence:
                - name: finalDf_csv
                  sql: SELECT * FROM a_city_csv LIMIT 10
                  functions:
                    - name: drop
                      columns:
                        - "__metadata_dataos_run_mapper_id"

Write Config¶

Note: the ?acl=rw after the UDL signifies Access Control List with Read Write Access. You can also specify the address of the output dataset in the format dataos://[depot]:[collection]?acl=rw. The name of the output dataset will automatically get appended to it.

For writing the data to a depot on an object store, we need to configure the name, dataset and format properties in the outputs section of the YAML. For instance, if your dataset is to be stored at the UDL address dataos://thirdparty01:sampledata by the name output01 and the file format is avro. Then the outputs section will be as follows

outputs:
  - name: output01 # output name
    dataset: dataos://thirdparty01:sampledata?acl=rw # address where the output is to be stored
    format: avro # file format: avro, csv, json, orc, parquet, txt, xlsx, xml

Sample Write configuration YAML

Let’s take a case scenario where the output dataset is to be stored in Azure Blob File System Depot (ABFSS), and you have to read data from the Icebase depot within the DataOS. The write config YAML will be as follows

object_storage_depots_write.yml

version: v1
name: azure-write-01-hive
type: workflow
tags:
- Sanity
- Azure
title: Sanity write to azure 
description: |
  The purpose of this job is to verify if we are able to write different  
  file formats into azure  or wasbs or not.

workflow:
  dag:
  - name: azure-write-01-hive
    title: Sanity write files to azure 
    description: |
      The purpose of this job is to verify if we are able to write different  
      file formats into azure  or wasbs or not.
    spec:
      tags:
      - Sanity
      - Azure
      stack: flare:6.0
      compute: runnable-default
      flare:
        job:
          explain: true
          logLevel: INFO
          showPreviewLines: 2
          inputs:
            - name: sanity_city_input
              dataset: dataos://thirdparty01:none/city?acl=rw
              format: csv
              schemaPath: dataos://thirdparty01:none/schemas/avsc/city.avsc

          steps:
            - sequence:
                - name: cities
                  doc: Pick all columns from cities and add version as yyyyMMddHHmm formatted
                    timestamp.
                  sql: |
                    SELECT
                      *,
                      date_format (now(), 'yyyyMMddHHmm') AS version,
                      now() AS ts_city
                    FROM
                      sanity_city_input limit 10

          outputs:
            - name: cities
              dataset: dataos://azurehiveiceberg:hivetest/azure_hive_iceberg_write_12?acl=rw
              format: iceberg
              options:
                saveMode: overwrite
                partitionBy:
                  - version
              tags:
                - Sanity
                - Azure
                - CSV
              title: Azure csv sanity
              description: Azure csv sanity

Advanced Configurations¶

Data Format Configurations¶

This section will provide comprehensive information on how to provide advanced source configurations when working with different data sources using DataOS’ Flare stack.

Data Source	Properties
AVRO	Link
CSV	Link
Iceberg	Link
JSON	Link
ORC	Link
Parquet	Link
Text	Link
XLSX	Link
XML	Link

Refer to the link below to know more.

Source Configurations by Data Formats

Schema Configurations¶

This section delves into the realm of schema configurations, offering invaluable insights into managing and customizing schemas for various data sources in Flare. Refer to the link below.

Schema Configurations