Google Bigquery Depots¶

Flare stack enables execution of jobs on top of Google Bigquery data source. To do so, you first need to create a Bigquery depot. If you have already created a depot, you can proceed with the following instructions. Otherwise, please refer to the Biguery Depot.

Read Configuration¶

For reading data using Flare stack, the name, dataset, and format properties in the inputs section of the YAML configuration need to be configured. Here's an example configuration for the inputs section:

inputs:
  - name: city_connect # name of the dataset
    dataset: dataos://bqdepot:state/city # address of the input dataset
    format: bigquery # format

Sample Read configuration YAML

Consider a scenario where the dataset named `city` is stored in a BigQuery Depot at the address `dataos://bqdepot:state/city`, and you need to read this data from the source Bigquery depot, perform some transformation steps, and write it to Icebase, a managed depot within DataOS. The read config YAML will be as follows:

bigquery_depot_read.yml

version: v1
name: bq-read
type: workflow
tags:
  - bq
  - City
description: This job read data from biquery and writes to lakehouse
title: Write bq
workflow:
  dag:
    - name: bq-read
      title: City read bq
      description: This job read data from azure and writes to Sbq
      spec:
        tags:
          - Connect
          - City
        stack: flare:6.0
        compute: runnable-default
        flare:
          job:
            explain: true
            inputs:
              - name: city_connect
                dataset: dataos://sanitybigquery:dev/bigquery_write_csv_12
                format: bigquery

            logLevel: INFO
            outputs:
              - name: finalDf
                dataset: dataos://lakehouse:smoketest/bigquery_read_csv_12?acl=rw
                format: iceberg
                options:
                  saveMode: overwrite
            steps:
              - sequence:
                  - name: finalDf
                    sql: SELECT * FROM city_connect limit 10

Write Configuration¶

For writing the data to a depot on a Bigquery depot, we need to configure the name, dataset and format properties in the outputs section of the YAML. For instance, if your dataset is to be stored at the UDL address is dataos://yakdevbq:dev/city_bq?acl=rw by the name finaldf and the file format is Bigquery. Then the outputs section will be as follows

outputs:
    - name: finalDf
    dataset: dataos://yakdevbq:dev/city_bq?acl=rw
    format: Bigquery
    options:
        saveMode: append
      bigquery:
          temporaryBucket: tmdc-development-new

Sample Write configuration YAML

Let’s take a case scenario where the output dataset is to be stored in Bigquery Depot and you have to read data from the Icebase depot within the DataOS The write config YAML will be as follows

bigquery_depot_write.yml

version: v1
name: bq-write-0310-03
type: workflow
tags:
  - bq
  - City
title: Write bq
workflow:
  dag:
    - name: bq-write-0310-03
      title: City write bq
      description: This job read data from bigquery and writes to Sbq
      spec:
        tags:
          - Connect
          - City
        stack: flare:6.0
        compute: runnable-default
        stackSpec:
          job:
            explain: true
            inputs:
              - name: city_connect
                dataset: dataos://thirdparty01:none/city
                format: csv
                schemaPath: dataos://thirdparty01:none/schemas/avsc/city.avsc
            logLevel: INFO
            outputs:
              - name: finalDf
                dataset: dataos://sanitybq:dev/bigquery_write_0310_03?acl=rw
                format: bigquery
                options:
                  saveMode: overwrite
                  bigquery:
                    temporaryBucket: tmdc-development-new
                  extraOptions:
                    intermediateFormat: avro

            steps:
              - sequence:
                  - name: finalDf
                    sql: SELECT * FROM city_connect limit 10

Methods for Writing Data to BigQuery¶

Writing data to BigQuery can be achieved using two methods: Direct and Indirect.

🗣 The default writeMethod is indirect.

Direct Write Method¶

The direct method writes data directly into a BigQuery table without using a temporary bucket. This method uses the BigQuery Storage Write API. To enable this, set the writeMethod to direct:

outputs:
  - dataset: dataos://sanitybigquery:dev/bigquery_table?acl=rw
    format: bigquery
    name: finalDf
    options:
      saveMode: append
      extraOptions:
        writeMethod: direct

🗣 Writing to existing partitioned tables (date partitioned, ingestion time partitioned, and range partitioned) in APPEND mode and OVERWRITE mode (only date and range partitioned) is fully supported by the connector and the BigQuery Storage Write API. The use of datePartition, partitionField, partitionType, partitionRangeStart, partitionRangeEnd, partitionRangeInterval is not supported by the direct write method.

Indirect Write Method¶

In the indirect method, data is first written to a GCS bucket and then loaded into BigQuery via a load operation. A GCS bucket must be configured to indicate the temporary data location. Data is temporarily stored using parquet, orc, or avro formats, with parquet being the default.

Parquet does not support Map data types. If using Flare to write data, which adds a __metadata column of Map type, use avro as the intermediateFormat.

outputs:
  - dataset: dataos://sanitybigquery:dev/bigquery_write_nabeel_103?acl=rw
    format: bigquery
    name: finalDf
    options:
      bigquery:
        temporaryBucket: tmdc-development-new
        persistentBucket: tmdc-development-new
      saveMode: append
      extraOptions:
        temporaryGcsBucket: tmdc-development-new
        persistentGcsBucket: tmdc-development-new
        intermediateFormat: avro

There are two ways to specify the bucket, either in bigquery or extraOptions. But for indirect write method, it has to be specified in the extraOptions.

Attributes Description

Attribute	Description	Usage
`writeMethod`	Controls the method for writing data to BigQuery. Available values are direct for the BigQuery Storage Write API and indirect for writing data first to GCS and then triggers a BigQuery load operation. (Optional, defaults to indirect)	Write
`temporaryGcsBucket`	The GCS bucket temporarily holding data before loading into BigQuery. Not supported by the direct write method.	Write
`persistentGcsBucket`	The GCS bucket holding data before loading into BigQuery. If set, data won't be deleted after writing to BigQuery. Not supported by the direct write method.	Write
`intermediateFormat`	Format of data before loading into BigQuery, either "parquet", "orc", or "avro". Defaults to parquet. Supported only for the indirect write method.	Write