Source Configurations by Data Formats¶
Within this page, you will find comprehensive instructions on implementing advanced configurations for your data reading and writing processes. Whether you are working with structured data formats such as AVRO, Parquet, or ORC or handling text files, XML, XLSX, or Iceberg files, we have got you covered.
AVRO¶
The AVRO format is a highly efficient and flexible storage format designed for Hadoop. It serves as a serialization platform, enabling seamless data exchange between programs written in different languages. AVRO is particularly well-suited for handling big data and offers robust support for evolving schemas.
To read AVRO data in Flare, you can use the following YAML configuration:
inputs:
- name: sample_avro
dataset: dataos://thirdparty01:sampledata/avro
format: avro
schemaPath: dataos://thirdparty01:sampledata/sample_avro.avsc
options:
avroSchema: none
datetimeRebaseMode: EXCEPTION/CORRECTED/LEGACY
positionalFieldMatching: true/false
Note
For a more comprehensive understanding of AVRO options and their usage, please refer to the refer to Avro options.
CSV¶
Flare supports reading/writing files or directories of files in CSV format. You have the flexibility to customize the behavior of the CSV reader by providing various options. These options allow you to control aspects such as header handling, delimiter character, character set, schema inference, and more.
Consider the following YAML configuration to define the options when reading from a CSV file:
inputs:
- name: sample_csv
dataset: /datadir/data/sample_csv.csv # complete file path
format: csv
options:
header: false/true
inferSchema: true
delimiter: " "
enforceSchema: false
timestampFormat: yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]
mode: PERMISSIVE/DROPMALFORMED/FAILFAST
- name: multiple_csvdata
dataset: /datadir/data # complete folder path
format: csv
options:
header: false/true
inferSchema: true
Note
For a detailed understanding of each option available for CSV files, please refer to the CSV options documentation.
Iceberg¶
Flare supports reading/writing files in the Iceberg format, which is a columnar data format designed for large-scale data processing. With Flare, you can leverage the power of Iceberg files and perform various operations on them.
To write an Iceberg file using Flare, you can use the following YAML configuration as an example:
outputs:
- name: ny_taxi_ts
dataset: dataos://lakehouse:sample/ny_taxi_iceberg?acl=rw
format: Iceberg
options:
saveMode: append
iceberg:
partitionSpec:
- type: identity
column: trip_duration
Note
For more information on the available options when working with Iceberg files, refer to the Apache Iceberg documentation.
JSON¶
Flare supports automatic schema inference for JSON files, allowing you to load them as DataOS datasets. Each line in the JSON file should contain a separate, self-contained valid JSON object. In the case of a multi-line JSON file, you can enable the multiLine
option to handle it appropriately.
To read from a JSON file and specify various options, utilize the following YAML configuration:
inputs:
- name: sample_json
dataset: dataos://thirdparty01:sampledata/json
format: json
options:
primitivesAsString: true/false
prefersDecimal: true/false
allowComments: true/false
mode: PERMISSIVE/DROPMALFORMED/FAILFAST
timestampFormat: yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX].
dateFormat: yyyy-MM-dd.
multiLine: true/false
Note
For a comprehensive understanding of the available options for JSON files, please refer to the JSON options documentation.
ORC¶
Flare supports reading/writing ORC (Optimized Row Columnar) files, which is a column-oriented data storage format within the Apache Hadoop ecosystem. ORC is designed to optimize performance and compression for big data processing workloads. With Flare, you can seamlessly read data from ORC files into your data workflows.
To read an ORC file using Flare, you can use the following YAML configuration:
inputs:
- name: sample_orc
dataset: /datadir/data/sample_orc.orc
format: orc
options:
mergeSchema: false/true # Sets whether we should merge schemas collected from all ORC part-files.
Note
For more details on the available options when working with ORC files, refer to the ORC options documentation.
Parquet¶
Flare offers support for reading and writing Parquet files, which is a widely adopted columnar storage format compatible with various data processing systems. When working with Parquet files, Flare automatically preserves the original data schema, ensuring data integrity throughout the processing pipeline.
Parquet also enables schema evolution, allowing users to progressively add columns to an existing schema as needed. This means that multiple Parquet files can have different but mutually compatible schemas. Flare's Parquet data source effortlessly detects and merges schemas from these files, simplifying schema management.
When writing to Parquet files, you can customize the behavior using the following configuration settings:
inputs:
- name: sample_parquet
dataset: dataos://thirdparty01:sampledata/parquet
format: parquet
schemaPath: dataos://thirdparty01:sampledata/sample_avro.avsc
options:
mergeSchema: false/true
datetimeRebaseMode: EXCEPTION/CORRECTED/LEGACY
Note
For detailed information on the available options for Parquet files, please refer to the Parquet options documentation.
Text¶
Flare supports reading text files or directories of text files into DataOS datasets. When reading a text file, each line is treated as a row with a default string "value" column. You can customize the reading behavior by specifying various options in YAML, including controlling the line separator behavior.
To read a text file using Flare, you can use the following YAML configuration:
inputs:
- name: sample_txt
dataset: dataos://thirdparty01:sampledata/sample.txt
format: text
options:
wholetext: true
lineSep: \r, \r\n, \n
- name: sample_txt_files
dataset: dataos://thirdparty01:sampledata/txt
format: text
options:
wholetext: true
lineSep: \r, \r\n, \n
Note
For more details on the available options when working with text files, refer to the Text options documentation.
XLSX¶
Flare supports the creation of datasets by reading data files in XLSX format. You can customize the behavior of reading XLSX files by configuring various options such as file location, sheet names, cell range, workbook password, and more. Flare also enables the reading of multiple XLSX files stored in a folder.
To read an XLSX file using Flare, you can use the following YAML configuration:
inputs:
- name: sample_xlsx
dataset: dataos://thirdparty01:sampledata/xlsx/returns.xlsx # pass complete file path
format: xlsx
options:
sheetName: aa
header: false/true
workbookPassword: password
inferSchema: true # If column types to be inferred when reading the file.
- name: sample_xlsx_files
dataset: dataos://thirdparty01:sampledata/xlsx # pass complete path for the folder where files are kept
format: xlsx
Note
For a more detailed understanding of the available options when working with XLSX files, refer to the Spark Excel options documentation. ****
XML¶
Flare provides support for reading/writing XML files, allowing you to leverage data stored in XML format. You have the flexibility to customize the reading behavior by specifying various options, such as handling corrupt records, validating XML against an XSD file, excluding attributes, and more.
To read an XML file using Flare, you can use the following YAML configuration as an example:
inputs:
- name: sample_xml
dataset: dataos://thirdparty01:sampledata/xml/sample.xml
format: xml
schemaPath: dataos://thirdparty01:none/schemas/avsc/csv.avsc
options:
path: Location of files
excludeAttribute : false
inferSchema: true
columnNameOfCorruptRecord: _corrupt_record
attributePrefix: _
valueTag: _VALUE
charset: 'UTF-8'
ignoreNamespace: false
timestampFormat: UTC
dateFormat: ISO_DATE
Note
For a detailed understanding of the available options when working with XML files, refer to the Spark XML options documentation.