Data Toolbox¶
Data Toolbox Stack or simply Toolbox Stack provides vital functionality in case of metadata updation in Icebase depots. When data is ingested into Icebase using Flare Stack, the metadata of ingested datasets needs to be registered with Metis before it can be queried using Workbench. The Data Toolbox allows set_version
action on the data stored in the DataOS internal storage Icebase, which uses the Iceberg format. The Metis keeps track of the Iceberg table by storing a reference to the latest metadata file. Using the Data Toolbox set_version
action, you can update the metadata version to the latest or any specific version.
Syntax of Data Toolbox YAML Configuration¶
Performing Data Toolbox Actions¶
You can write Data Toolbox action as a separate workflow or part of a dag in a workflow.
Toolbox Action as a separate Workflow¶
To perform a data toolbox action, follow the below steps:
Create a YAML file for Workflow¶
If you have already ingested data, you can create a separate workflow for toolbox action, as shown in the following YAML. To know more about workflows, click here.
Define a Job that executes upon Toolbox Stack¶
Within the DAG, define a job that executes upon the Toolbox Stack. To know more about the various properties for a job executed upon toolbox stack, click here.
Sample Toolbox Workflow
version: v1
name: dataos-toolbox-workflow
type: workflow
workflow:
dag:
- name: dataos-toolbox-city-01
spec:
stack: toolbox
compute: runnable-default
stackSpec:
dataset: dataos://icebase:sample/city?acl=rw
action:
name: set_version
value: latest
Field | Data Type | Default Value | Possible Value | Requirement |
---|---|---|---|---|
stackSpec |
object | none | none | mandatory |
dataset |
string | none | any valid iceberg dataset udl address | mandatory |
action |
object | none | none | mandatory |
name |
string | none | set_version | mandatory |
value |
string | none | latest or any other specific metadata version | mandatory |
To know more about Toolbox-specfic Section YAML Configuration fields, click here.
Apply the Workflow using CLI¶
Use the apply command to apply the workflow using CLI
dataos-ctl apply -f <path/file-name> -w <workspace> # By default the workspace is public so you may not include the -w flag
Toolbox Action in a Job within the Flare Workflow¶
Instead of creating a separate workflow for Toolbox, you can create a single workflow. With a DAG with two jobs, one job runs on Flare Stack while the second one runs on the Toolbox Stack, which depends on the first job for the start of execution.
Sample Toolbox Stack Action in a Job within a Workflow
version: v1 # Version
name: wf-sample-002 # Workflow Name
type: workflow # Name of the Resource
tags: # Tags
- Con
- CONNECT
description: The job ingests data using Flare and registers the metadata using Toolbox Stack # Description of the Workflow
workflow: # Workflow Section
title: Connect City # Title
dag: # Directed Acyclic Graph
# Job 1 executed upon Flare Stack: This job ingests city data
- name: wf-sample-job-001
title: City Data Ingester
description: The job ingests city data
spec:
tags:
- Connect
- City
stack: flare:5.0
compute: runnable-default
stackSpec:
job:
explain: true
logLevel: INFO
inputs:
- name: city_connect
dataset: dataos://thirdparty01:none/city
format: csv
schemaPath: dataos://thirdparty01:none/schemas/avsc/city.avsc
outputs:
- name: cities
dataset: dataos://icebase:retail/city01?acl=rw
format: Iceberg
description: City data ingested from external csv
options:
saveMode: append
sort:
mode: partition
columns:
- name: version
order: desc
iceberg:
properties:
write.format.default: parquet
write.metadata.compression-codec: gzip
partitionSpec:
- type: identity
column: version
steps:
- sequence:
- name: cities
doc: Pick all columns from cities and add version as yyyyMMddHHmm formatted
timestamp.
sql: |
SELECT
*,
date_format (now(), 'yyyyMMddHHmm') AS version,
now() AS ts_city
FROM
city_connect
# Job 2 executes upon Toolbox Stack: This job registers latest version of metadata to Metis
- name: data-tool-job-001
spec:
stack: toolbox
compute: runnable-default
stackSpec:
dataset: dataos://icebase:retail/city01
action:
name: set_version
value: latest
dependencies: wf-sample-job-001
Once you define the workflow, you can apply it using CLI.