dbt¶

The core capabilities of dbt (data build tool) can be leveraged by provisioning it as a stack within DataOS, enhancing the transformation part of the ELT (Extract, Load, Transform) process. It supports the centralization and modularization of analytics code, following a structured approach similar to software engineering practices. dbt enables collaboration on data models, versioning, and deployment to production. By connecting with the data platform, it executes SQL code within the warehouse to perform data transformations, ensuring a single source of truth for metrics, insights, and business terminology while promoting collaboration across teams.

Prerequisites

Basic to intermediate SQL: Proficiency in utilizing the WHERE and GROUP BY clauses is essential.
Basics of data warehouses: A foundational understanding of data engineering is highly beneficial. It doesn't need to be extensive, but grasping key concepts is important.

How to create and use dbt workflow?¶

Create a Workflow manifest¶

dbt operates as a Stack that can be orchestrated through a Workflow Resource. A Workflow is a DataOS Resource that represents a Directed Acyclic Graph (DAG) of jobs, where each job is executed using a Stack. It acts as an orchestrator for a Stack.

A dbt workflow is composed of the following sections:

Configure the dbt Workflow-specific section¶

The Workflow-specific section contains configurations specific to the Workflow Resource.

The code snippet provided below shows a sample Workflow manifest.

SyntaxSample

syntax_resource_dbt_meta_section.yaml

name: ${dbt-workflowname}
version: v1
type: workflow
tags:
  - ${tag1}
description: ${DBT Sample Workflow}
workspace: sandbox
workflow:
  dag:
    - name: ${dbt-jobname}
      title: ${title}
      spec:
        stack: dbt+python:1.0
        compute: ${compute-name}
        resources:
          requests:
            cpu: 1000m
            memory: 250Mi
          limits:
            cpu: 1000m
            memory: 250Mi
        logLevel: ${log-level}
        stackSpec:
          #stack specific section

syntax_resource_dbt_meta_section.yaml

name: dbt-workflow-v01
version: v1
type: workflow
tags:
  - tag1
description: DBT Sample Workflow
workspace: sandbox
workflow:
  dag:
    - name: dbt-workflow-job-v2
      title: dbt Sample Test Jobs
      spec:
        stack: dbt+python:1.0
        compute: 
        resources:
          requests:
            cpu: 1000m
            memory: 250Mi
          limits:
            cpu: 1000m
            memory: 250Mi
        logLevel: INFO
        stackSpec:
           #stack specific section

Declare the configuration for DBT `stackSpec` section¶

The Workflow Resource comprises of a stackSpec section (or a mapping) containing the attributes of the Stack to be orchestrated. Specifically within the dbt Stack context, the stackSpec includes diverse configuration file.

The manifest below shows a sample structure of the DBT stackSpec Section:

SyntaxSample

syntax_dbt_stackspec_section.yaml

stackSpec:
  profiles:
    my_new_project:
      target: ${dev}
      outputs:
        dev:
          type: ${data platform type}
            #database- specific connection details

  dbt_project:
    name: ${project_name}
    version: '1.0.0'
    config-version: 2
    model-paths: ${path of the model}
    profile:  ${project_name}
    target-path: "target"  #directory which will store compiled SQL files
    clean-targets:         #directories to be removed by `dbt clean`
      - "target"
      - "dbt_packages"
    models:
      ${modelname}:
          +materialized: view

  dbt_packages:
    packages:
      - git: "https://github.com/dataos/dbt101.git"

sample_dbt_stackspec_section.yaml

stackSpec:
  profiles:
    dbtlearnnew:
      target: dev
      outputs:
        dev:
          type: snowflake
          account: wylos-aidb-aq8123
            #account_id
          user: iamgroot
          password: iamgroot@123
          database: SAMPLEDB
          warehouse: COMPUTE_WH
          schema: PUBLIC
          role: DBTADMIN
          threads: 4
  dbt_project:
    name: 'dbtlearnnew'
    version: '1.0.0'
    config-version: 2
    model-paths: ["models"]
    profile: 'dbtlearnnew'
    target-path: "target"  #directory which will store compiled SQL files
    clean-targets:         #directories to be removed by `dbt clean`
      - "target"
      - "dbt_packages"
    models:
      dbtlearnnew:
            +materialized: view
  dbt_packages:
    packages:
      - git: "https://github.com/dataos/dbt101.git"

Stack spec section¶

The DBT stackSpec section comprises of following key attributes:

Each of these sections is mapping and comprises several section-specific attributes. The subsequent parts provide details on the necessary configurations.

Profiles section¶

The profiles.yml file is a configuration file that specifies connection details for target databases. It includes information to connect to the data warehouse/platform.

SyntaxSample

profiles:
  <profile-name>:
    target: dev
    outputs:
      dev:
        type: <bigquery | postgres | redshift | snowflake >
        threads: 4
        #database-specific connection details

profiles:
  my_new_project:
    target: dev
    outputs:
      dev:
        type: snowflake 
        threads: 4
        #datawarehouse-specific connection details

The following table provides a comprehensive overview of the various attributes within the Profile-specific Section.

Common Attributes

Attributes	Data Type	Default Value	Possible Value	Requirement
[`profile`]	mapping	none	my_project	mandatory
[`target`]	string	none	dev	mandatory
[`outputs`]	mapping	none	dev	mandatory
[`type`]	string	none	bigquery, postgres	mandatory
[`threads`]	string	none	[1-4]	mandatory

Project section¶

The dbt_project.yml serves as the primary configuration file which informs DBT about the context of your project and how to transform your data.

It is crucial for DRY (Don't Repeat Yourself) analytics code. Essentially, it acts as the repository for project-wide default configurations, and all objects will inherit from it unless overridden at the model level.

SyntaxSyntax

syntax_dbt_project_section.yaml

#Project names should contain only lowercase characters and underscores.
name: ${project_name}
version: ${project_version}
config-version: 2

#This setting configures which "profile" dbt uses for this project.
profile: ${profile_name}

#These configurations specify where dbt should look for different types of files.
#The `model-paths` config, for example, states that models in this project can be
#found in the "models/" directory. You probably won't need to change these!
model-paths: [${model_paths}]

#Configuring models
models:
  ${project_name}:
    ${model_group}:
    #Config indicated by + and applies to all files under models/example/
    +materialised: ${materialisation}

sample_dbt_project_section.yaml

name: training_dbt
version: 1.0.0
config-version: 2

profile: training_dbt
  #These configurations specify where dbt should look for different types of files.
  #The model-paths config, for example, states that models in this project can be
  #found in the "models/" directory. You probably won't need to change these!
  model-paths: [models]

  #Configuring models
  models:
    training_dbt:
      example:
      #Config indicated by + and applies to all files under models/example/
      +materialised: view

Attributes of project in stackSpec section

Attributes	Data Type	Default Value	Possible Value	Requirement
`name`	string	none	Your project's name in snake case	mandatory
`version`	version	none	Version of your project	mandatory
`require-dbt-version`	version range	none	Restrict your project to only work with a range of dbt Core versions	optional
`profile`	string	none	The profile dbt uses to connect to your data platform	mandatory
`model-paths`	string	none	Directories to where your model and source files live	mandatory
`seed-paths`	list of strings	none	Directories to where your seed files live	optional
`test-paths`	list of strings	none	Directories to where your test files live	optional
`analysis-paths`	list of strings	none	Directories to where your analyses live	optional
`macro-paths`	list of strings	none	Directories to where your macros live	optional
`snapshot-paths`	list of strings	none	Directories to where your snapshots live	optional
`docs-paths`	list of strings	none	Directories to where your docs blocks live	optional
`clean-targets`	list of strings	none	List of clean targets for the project	optional
`models`	list of dictionaries	none	List of model configurations	mandatory

dbt_packages section¶

The dbt_packages section is designed to specify the dbt projects that you wish to incorporate as dependencies. These projects, referred to as dbt packages, can be seamlessly integrated into your own dbt project, fostering modularity and code sharing.

SyntaxSample

dbt_packages:
  packages:
  - git: <GIT_REPO_URL>

dbt_packages:
  packages:
    - git: "https://github.com/iamgroot/dbt-101.git"

Apply the manifest using CLI¶

Use the apply command to apply the workflow using CLI.

CommandExample

dataos-ctl resource apply -f ${yaml-file-path} -w ${workspace-name}
#or
dataos-ctl apply -f ${yaml-file-path} -w ${workspace-name}

dataos-ctl resource apply -f dbt.yaml -w sandbox # By default the workspace is public
#or
dataos-ctl apply -f dbt.yaml -w sandbox

Get Status of the Workflow

CommandExample

dataos-ctl resource get -t ${workspace-name} -w {sandbox} -n ${workflow-name}

dataos-ctl resource get -t workflow -w sandbox -n dbt-worfklow-v01
#Expected Output
INFO[0000] 🔍 get...
INFO[0001] 🔍 get...complete

          NAME        | VERSION |   TYPE   | WORKSPACE | STATUS | RUNTIME |   OWNER
----------------------|---------|----------|-----------|--------|---------|-------------
  dbt-workflow-v01   |   v1    | workflow |   sandbox  | active | running |   tmdc

To check this information for all users in a specific Workspace, add the -a flag to the command as shown below.

CommandExample

dataos-ctl resource get -t workflow -w ${workspace-name} -a

dataos-ctl resource get -t workflow -w sandbox -a
#Expected Output
INFO[0000] 🔍 get...
INFO[0001] 🔍 get...complete

          NAME           | VERSION |   TYPE   | WORKSPACE | STATUS |  RUNTIME  |       OWNER
-------------------------|---------|----------|-----------|--------|-----------|--------------------
  dbt-workflow-v01       | v1      | workflow | sandbox    | active | succeeded | wonderwoman
  cnt-product-demo-01    | v1      | workflow | sandbox    | active | running   | tmdc
  cnt-product-demo-01-01 | v1      | workflow | sandbox    | active | failed    | otheruser
  cnt-city-demo-01001    | v1      | workflow | sandbox    | active | succeeded | iamgroot

Get Runtime Information

To refresh or see updates on the Workflow progress, add the -r flag to the [get runtime](/interfaces/cli/command_reference/#get-runtime) command:

dataos-ctl -i get runtime " cnt-product-demo-01 | v1 | workflow | sandbox" -r

Press Ctrl + C to exit.

For any additional flags, use help by appending -h with the respective command.

Case study¶

Summary¶

This dbt project aims to transform and prepare data through various stages, starting with cleaning and renaming in the staging phase, moving to intermediate transformations, building a core data layer, creating non-volatile datasets for mapping, and finally performing detailed analytics. The use cases highlight the specific objectives of each transformation, contributing to a comprehensive and well-organized data analytics workflow.

Staging Folder¶

stg_jaffle_shop_customers.sql

select
    id as customer_id,
    concat(first_name, ' ', last_name) as customer_name,
    current_timestamp() as dp_load_date
from {{ source('jaffle_shop', 'customers') }}

stg_jaffle_shop_orders.sql

#Renamming columns for deeper understanding

select
    id as order_id,
    user_id as customer_id,
    order_date,
    status,
    _etl_loaded_at
from {{ source('jaffle_shop', 'orders') }}

Intermediate Folder¶

Payment Type Analysis:

Analyze payment data to understand the distribution of cash and credit payments.

Total Payment Amount Calculation:

Calculate the total amount paid for orders with a "success" payment status.
in_payment_type_amount.sql:
Create a CTE (order_payments) based on the stg_stripe_payment.
Summarize payment amounts based on different payment types and order statuses.

-- int_payment_type_amount.sql
with order_payments as (
    select * from {{ ref('stg_stripe_payment') }}
)
select
    order_id,
    sum(
        case
        when payment_type = 'cash' and
            status = 'success'
        then amount
        else 0
        end
    ) as cash_amount,
    sum(
        case
        when payment_type = 'credit' and
            status = 'success'
        then amount
        else 0
        end
    ) as credit_amount,
    sum(case
        when status = 'success'
        then amount
        end
    ) as total_amount
from order_payments
group by 1

Marts/Core Folder¶

Customer Dimension Creation:
- Create a dimensional view of customers, incorporating information from the staged customer data.
Customer Data Exploration:
- Use the dim_customers view for exploratory analysis and understanding customer attributes.
dim_customers.sql:
- Transformation Objective:
  - Create a CTE (customers) based on the stg_jaffle_shop_customers.
  - Select all columns from the customers CTE.

customers.sql

with customers as (
    select * from {{ ref('stg_jaffle_shop_customers') }}
)
select * from customers

Order Fact Creation:

Create a fact table (fact_orders) representing detailed order information, including payment types and amounts.
Order Completion Flag:
- Introduce a flag (is_order_completed) indicating whether an order is completed or not based on its status.

fact_orders.sql

-- fact_table
with orders as (
    select * from {{ ref('stg_jaffle_shop_orders' )}}
),
payment_type_orders as (
    select * from {{ ref('in_payment_type_amount_per_order' )}}
)
select
    ord.order_id,
    ord.customer_id,
    ord.order_date,
    pto.cash_amount,
    pto.credit_amount,
    pto.total_amount,
    case
    when status = 'completed'
    then 1
    else 0
    end as is_order_completed
from orders as ord
left join payment_type_orders as pto ON ord.order_id = pto.order_id

fact_orders.sql:
- Transformation Objective:
  - Create CTEs (orders and payment_type_orders) based on the stg_jaffle_shop_orders and in_payment_type_amount_per_order.
  - Perform final transformations, including joining relevant CTEs and calculating additional fields.

Possible Business Use Cases:

Customer Paid Amount Aggregation:
- Aggregate the total paid amount for each customer based on completed orders.
Customer Classification:
- Classify customers into different categories (Regular, Bronze, Silver, Gold) based on their total paid amounts.

customer_range_based_on_total_paid_amount.sql

with fact_orders as (
    select * from {{ ref('fact_orders')}}
),
dim_customers as (
    select * from {{ ref('dim_customers' )}}
),
total_amount_per_customer_on_orders_complete as (
    select
        cust.customer_id,
        cust.customer_name,
        SUM(total_amount) as global_paid_amount
    from fact_orders as ord
    left join dim_customers as cust ON ord.customer_id = cust.customer_id
    where ord.is_order_completed = 1
    group by cust.customer_id, customer_name
),
customer_range_per_paid_amount as (
    select * from {{ ref('seed_customer_range_per_paid_amount' )}}
)
select
    tac.customer_id,
    tac.customer_name,
    tac.global_paid_amount,
    crp.classification
from total_amount_per_customer_on_orders_complete as tac
left join customer_range_per_paid_amount as crp
on tac.global_paid_amount >= crp.min_range
and tac.global_paid_amount <= crp.max_range

Seeds Folder¶

Objective: Create non-volatile datasets for mapping and enriching data.

Use Case: In the seeds folder, we store small datasets to map values.

seed_customer_range_per_paid_amount.csv:
Transformation Objective:
- Define ranges for customer classifications based on total paid amounts.
- Ranges include Regular, Bronze, Silver, and Gold classifications.

Example:

seed_customer_range_per_paid_amount.csv

#with ranges mapping
min_range,max_range,classification
0,9.999,Regular
10,29.999,Bronze
30,49.999,Silver
50,9999999,Gold

Run command to materialize CSV file into a table in data platform.

Analyses Folder¶

Objective: Perform detailed analytics and generate insights for reporting.

Use Case: In the analyses folder, we create SQL queries to analyze and derive insights.

customer_range_based_on_total_paid_amount.sql:
- Transformation Objective:
  - Create CTEs based on fact_orders, dim_customers, total_amount_per_customer_on_orders_complete, and customer_range_per_paid_amount.
  - Analyze and categorize customers based on their total paid amounts.

Example:

customer_range_based_on_total_paid_amount.sql

#based on the completed orders and the total amount paid, the customer classification range.

with fct_orders as (
select * from {{ ref('fct_orders')}}
),
dim_customers as (
select * from {{ ref('dim_customers' )}}
),
total_amount_per_customer_on_orders_complete as (
select
cust.customer_id,
cust.first_name,
SUM(total_amount) as global_paid_amount
from fct_orders as ord
left join dim_customers as cust ON ord.customer_id = cust.customer_id
where ord.is_order_completed = 1
group by cust.customer_id, first_name
),
customer_range_per_paid_amount as (
select * from {{ ref('seed_customer_range_per_paid_amount' )}}
)
select
tac.customer_id,
tac.first_name,
tac.global_paid_amount,
crp.classification
from total_amount_per_customer_on_orders_complete as tac
left join customer_range_per_paid_amount as crp
on tac.global_paid_amount >= crp.min_range
and tac.global_paid_amount <= crp.max_range

It will give each customer the total amount paid and its corresponding range.

dbt¶

How to create and use dbt workflow?¶

Create a Workflow manifest¶

Configure the dbt Workflow-specific section¶

Declare the configuration for DBT stackSpec section¶

Stack spec section¶

Profiles section¶

Project section¶

dbt_packages section¶

Apply the manifest using CLI¶

Case study¶

Summary¶

Staging Folder¶

Intermediate Folder¶

Marts/Core Folder¶

Seeds Folder¶

Analyses Folder¶

Declare the configuration for DBT `stackSpec` section¶