Object Storage Configurations¶

To execute Flare Jobs on object storage depots such as Amazon S3, Azure ABFSS, Azure WASBS, Google Cloud Storage etc., a corresponding depot must first be created. If the required depot has not yet been created, refer to the following documentation links:

Depots created on top of supported object stores enable uniform interaction across platforms, including Azure Blob File System, Google Cloud Storage, and Amazon S3.

To run a Flare Job, the following information is required:

The Uniform Data Locator (UDL) address of the input dataset (for read operations) or the output dataset (for write operations).
The file format of the associated data.

Common Configurations¶

Read Configuration¶

For reading the data, we need to configure the name, dataset, and format properties in the inputs section of the YAML. For instance, if your dataset name is city_connect, UDL address dataset stored in Azure Blob Storage is dataos://thirdparty01:sampledata/avro, and the file format is avro. Then the inputs section will be as follows-

inputs:
  - name: city_connect                                # name of the dataset
    dataset: dataos://thirdparty01:sampledata/avro    # address of the input dataset
    format: avro                                      # file format: avro, csv, json, orc, parquet, txt, xlsx, xml

Your Flare Jobs can read from multiple data sources. In such a scenario, you have to provide an array of data source definitions as shown below.

inputs:  
  - name: sample_csv                                                  # name of the dataset
    dataset: dataos://thirdparty01:none/sample_city.csv               # address of the input dataset
    format: csv                                                       # file format
    schemaPath: dataos://thirdparty01:default/schemas/avsc/city.avsc  # schema path
    schemaType:                                                       # schema type
    options:                                                          # additional options
      key1:value1                                                     # Data source-specific options
      key2:value2

  - name: sample_states                                               # name of the dataset
    dataset: dataos://thirdparty01:none/states                        # address of the input dataset
    format: csv                                                       # file format
                                                                      # schema defining
    schema: "{\"type\":\"struct\",\"fields\":[{\"name\":\"country_code\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"country_id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"latitude\",\"type\":\"double\",\"nullable\":true,\"metadata\":{}},{\"name\":\"longitude\",\"type\":\"double\",\"nullable\":true,\"metadata\":{}}]}"

  - name: transaction_abfss                                           # name of the dataset
    dataset: dataos://abfss01:default/transactions_abfss_01           # address of the input dataset
    format: avro                                                      # file format
    options:                                                          # additional options
      key1:value1                                                     # Data source-specific options
      key2:value2

  - name: input_customers                                             # name of the dataset
    dataset: dataos://lakehouse:retail/customer                         # address of the input dataset
    format: iceberg                                                   # file format

Sample Read configuration YAML

Let’s take a case scenario where the dataset is stored in Azure Blob File System (ABFSS) and you have to read data from the source, perform some transformation steps and write it to the Lakehouse, which is a managed depot within the DataOS. The read config YAML will be as follows

object_storage_depots_read.yml

version: v1
name: sanity-read-azure
type: workflow
tags:
  - Sanity
  - Azure
title: Sanity read from Azure
description: |
  The purpose of this workflow is to verify if we are able to read different
  file formats from azure abfss or not. 

workflow:
  dag:
  - name: sanity-read-az-job
    title: Sanity read files from azure abfss
    description: |
      The purpose of this job is to verify if we are able to read different
      file formats from azure abfss or not. 
    spec:
      tags:
      - Sanity
      - Abfss
      stack: flare:7.0
      compute: runnable-default
      stackSpec:
        job:
          explain: true
          logLevel: INFO
          showPreviewLines: 2
          inputs:
            - name: a_city_csv
              dataset: dataos://sanityazure:sanity/azure_write_csv_14?acl=rw
              format: csv
            - name: a_city_json
              dataset: dataos://sanityazure:sanity/azure_write_json
              format: json
            - name: a_city_parquet
              dataset: dataos://sanityazure:sanity/azure_write_parquet
              format: parquet


          outputs:
            # csv
            - name: finalDf_csv                                 
              dataset: dataos://lakehouse:smoketest/azure_read_csv_14?acl=rw
              format: iceberg
              options:
                saveMode: overwrite
                partitionBy:
                  - version
              tags:
                - Sanity
                - Azure
                - CSV
              title: Azure csv read sanity
              description: Azure csv read sanity

            # json
            - name: finalDf_json
              dataset: dataos://lakehouse:sanity/azure_read_json?acl=rw
              format: json
              options:
                saveMode: overwrite
                partitionBy:
                  - version
              tags:
                - Sanity
                - Azure
                - JSON
              title: Azure json read sanity
              description: Azure json read sanity  

            # parquet
            - name: finalDf_parquet
              dataset: dataos://lakehouse:sanity/azure_read_parquet?acl=rw
              format: parquet
              options:
                saveMode: overwrite
                partitionBy:
                  - version
              tags:
                - Sanity
                - Azure
                - Parquet
              title: Azure parquet read sanity
              description: Azure parquet read sanity


          steps:
            - sequence:
                - name: finalDf_csv
                  sql: SELECT * FROM a_city_csv LIMIT 10
                  functions:
                    - name: drop
                      columns:
                        - "__metadata_dataos_run_mapper_id"

Write Configuration¶

Note

The ?acl=rw suffix in the UDL indicates that the Access Control List (ACL) is configured with read-write permissions. The address of the output dataset can also be specified using the format dataos://[depot]:[collection]?acl=rw. The system will automatically append the name of the output dataset to this address.

For writing the data to a depot on an object store, we need to configure the name, dataset and format properties in the outputs section of the YAML. For instance, if your dataset is to be stored at the UDL address dataos://thirdparty01:sampledata by the name output01 and the file format is avro. Then the outputs section will be as follows

outputs:
  - name: output01                                   # output name
    dataset: dataos://thirdparty01:sampledata?acl=rw # address where the output is to be stored
    format: avro                                     # file format: avro, csv, json, orc, parquet, txt, xlsx, xml

Sample Write configuration YAML

Let’s take a case scenario where the output dataset is to be stored in Azure Blob File System Depot (ABFSS), and you have to read data from the Lakehouse depot within the DataOS. The write config YAML will be as follows

object_storage_depots_write.yml

version: v1
name: azure-write-01-hive
type: workflow
tags:
- Sanity
- Azure
title: Sanity write to azure 
description: |
  The purpose of this job is to verify if we are able to write different  
  file formats into azure  or wasbs or not.

workflow:
  dag:
  - name: azure-write-01-hive
    title: Sanity write files to azure 
    description: |
      The purpose of this job is to verify if we are able to write different  
      file formats into azure  or wasbs or not.
    spec:
      tags:
      - Sanity
      - Azure
      stack: flare:7.0
      compute: runnable-default
      stackSpec:
        job:
          explain: true
          logLevel: INFO
          showPreviewLines: 2
          inputs:
            - name: sanity_city_input
              dataset: dataos://thirdparty01:none/city?acl=rw
              format: csv
              schemaPath: dataos://thirdparty01:none/schemas/avsc/city.avsc

          steps:
            - sequence:
                - name: cities
                  doc: Pick all columns from cities and add version as yyyyMMddHHmm formatted
                    timestamp.
                  sql: |
                    SELECT
                      *,
                      date_format (now(), 'yyyyMMddHHmm') AS version,
                      now() AS ts_city
                    FROM
                      sanity_city_input limit 10

          outputs:
            - name: cities
              dataset: dataos://azurehiveiceberg:hivetest/azure_hive_iceberg_write_12?acl=rw
              format: iceberg
              options:
                saveMode: overwrite
                partitionBy:
                  - version
              tags:
                - Sanity
                - Azure
                - CSV
              title: Azure csv sanity
              description: Azure csv sanity

Schema Configurations¶

This section describes schema configuration strategies used to manage and customize schemas for supported data sources within the Flare stack. For implementation guidance, refer to the Schema Configurations documentation.

Data Formats Configurations¶

For detailed information for all supported formats, see Source Configurations by Data Formats. The following list provides format-specific configuration references for integrating various data sources with the Flare stack:

AVRO – Describes how to configure AVRO files for source ingestion.
CSV – Covers options for parsing and validating CSV-formatted input. View CSV configuration
Iceberg – Provides guidance on configuring Apache Iceberg table formats.
JSON – Explains how to manage nested structures and data typing for JSON input.
ORC – Details parameter settings for optimized ingestion of ORC files.
Parquet – Outlines best practices for reading schema-aware Parquet data.
Text – Defines configuration options for Plain text data sources.
XLSX – Specifies how to configure Excel spreadsheet(XLSX) ingestion.
XML – Provides details on parsing and validating structured XML input.