Skip to content

Google Bigquery Depots

Flare stack enables execution of jobs on top of Google Bigquery data source. To do so, you first need to create a Bigquery depot. If you have already created a depot, you can proceed with the following instructions. Otherwise, please refer to the Biguery Depot.

Read Configuration

For reading data using Flare stack, the name, dataset, and format properties in the inputs section of the YAML configuration need to be configured. Here's an example configuration for the inputs section:

inputs:
  - name: city_connect # name of the dataset
    dataset: dataos://bqdepot:state/city # address of the input dataset
    format: bigquery # format
Sample Read configuration YAML Consider a scenario where the dataset named `city` is stored in a BigQuery Depot at the address `dataos://bqdepot:state/city`, and you need to read this data from the source Bigquery depot, perform some transformation steps, and write it to Icebase, a managed depot within DataOS. The read config YAML will be as follows:
bigquery_depot_read.yml
name: bq-read-01
version: v1
type: workflow
tags:
  - bq
  - City
description: This job read data from azure and writes to S3
title: Write bq
workflow:
  dag:
    - name: city-read-bq-01
      title: City read bq
      description: This job read data from azure and writes to Sbq
      spec:
        tags:
          - Connect
          - City
        stack: flare:5.0
        compute: runnable-default
        stackSpec:
          job:
            explain: true
            inputs:
              - name: city
                dataset: dataos://bqdepot:state/city
                format: bigquery
            logLevel: INFO
            outputs:
              - name: finalDf
                dataset: dataos://icebase:sanity/city?acl=rw
                format: iceberg
                options:
                  saveMode: overwrite
            steps:
              - sequence:
                  - name: finalDf
                    sql: SELECT * FROM city_connect

Write Configuration

For writing the data to a depot on a Bigquery depot, we need to configure the name, dataset and format properties in the outputs section of the YAML. For instance, if your dataset is to be stored at the UDL address is dataos://yakdevbq:dev/city_bq?acl=rw by the name finaldf and the file format is Bigquery. Then the outputs section will be as follows

outputs:
    - name: finalDf
    dataset: dataos://yakdevbq:dev/city_bq?acl=rw
    format: Bigquery
    options:
        saveMode: append
      bigquery:
          temporaryBucket: tmdc-development-new

Sample Write configuration YAML

Let’s take a case scenario where the output dataset is to be stored in Bigquery Depot and you have to read data from the Icebase depot within the DataOS The write config YAML will be as follows

bigquery_depot_write.yml
version: v1
name: bq-read-write-02
type: workflow
tags:
  - bq
  - City
title: Write bq
workflow:
  dag:
    - name: city-write-bq-02
      title: City write bq
      description: This job read data from azure and writes to Sbq
      spec:
        tags:
          - Connect
          - City
        stack: flare:5.0
        compute: runnable-default
        stackSpec:
          job:
            explain: true
            inputs:
              - name: city_connect
                dataset: dataos://thirdparty01:none/city
                format: csv
                schemaPath: dataos://thirdparty01:none/schemas/avsc/city.avsc
            logLevel: INFO
            outputs:
              - name: finalDf
                dataset: dataos://yakdevbq:dev/city_bq?acl=rw
                format: Bigquery
                options:
                  saveMode: append
                  bigquery:
                    temporaryBucket: tmdc-development-new
            steps:
              - sequence:
                  - name: finalDf
                    sql: SELECT * FROM city_connect

Methods for Writing Data to BigQuery

Writing data to BigQuery can be achieved using two methods: Direct and Indirect.

Direct Write Method

The direct method writes data directly into a BigQuery table without using a temporary bucket. This method uses the BigQuery Storage Write API. To enable this, set the writeMethod to direct:

outputs:
  - dataset: dataos://sanitybigquery:dev/bigquery_table?acl=rw
    format: bigquery
    name: finalDf
    options:
      saveMode: append
      extraOptions:
        writeMethod: direct

Indirect Write Method

In the indirect method, data is first written to a GCS bucket and then loaded into BigQuery via a load operation. A GCS bucket must be configured to indicate the temporary data location. Data is temporarily stored using parquet, orc, or avro formats, with parquet being the default.

Parquet does not support Map data types. If using Flare to write data, which adds a __metadata column of Map type, use avro as the intermediateFormat.

outputs:
  - dataset: dataos://sanitybigquery:dev/bigquery_write_nabeel_103?acl=rw
    format: bigquery
    name: finalDf
    options:
      bigquery:
        temporaryBucket: tmdc-development-new
        persistentBucket: tmdc-development-new
      saveMode: append
      extraOptions:
        temporaryGcsBucket: tmdc-development-new
        persistentGcsBucket: tmdc-development-new
        intermediateFormat: avro

There are two ways to specify the bucket, either in bigquery or extraOptions. But for indirect write method, it has to be specified in the extraOptions.

Attributes Description

Attribute Description Usage
writeMethod Controls the method for writing data to BigQuery. Available values are direct for the BigQuery Storage Write API and indirect for writing data first to GCS and then triggers a BigQuery load operation. (Optional, defaults to indirect) Write
temporaryGcsBucket The GCS bucket temporarily holding data before loading into BigQuery. Not supported by the direct write method. Write
persistentGcsBucket The GCS bucket holding data before loading into BigQuery. If set, data won't be deleted after writing to BigQuery. Not supported by the direct write method. Write
intermediateFormat Format of data before loading into BigQuery, either "parquet", "orc", or "avro". Defaults to parquet. Supported only for the indirect write method. Write