DBT Stack¶
dbt (data build tool) is a declarative Stack within DataOS, acting as the transformative force in ELT (Extract, Load, Transform), dbt focuses on optimizing and refining data already loaded into the database. It facilitates the centralization and modularization of analytics code, adopting a structured approach akin to software engineering practices. It enables collaborative work on data models, and versioning, before their deployment to production. It establishes a connection with the data platform and executes SQL code within the warehouse to perform data transformations, fostering collaboration among you and your team by establishing a single source of truth for metrics, insights, and business terminology.
Prerequisites
-
Basic to intermediate SQL: proficiency in utilizing the WHERE and GROUP BY clauses is essential.
-
Basics of data warehouses: a foundational understanding of data engineering is highly beneficial. It doesn't need to be extensive, but grasping key concepts is important.
How to create and use dbt workflow?¶
Create a Workflow manifest¶
dbt operates as a Stack that can be orchestrated through a Workflow Resource. A Workflow is a DataOS Resource that represents a Directed Acyclic Graph (DAG) of jobs, where each job is executed using a Stack. It acts as an orchestrator for a Stack.
A DBT workflow is composed of the following sections:
Configure the DBT Workflow-specific section¶
The Workflow-specific section contains configurations specific to the Workflow Resource.
The code snippet provided below shows a sample Workflow manifest.
name: ${dbt-workflowname}
version: v1
type: workflow
tags:
- ${tag1}
description: ${DBT Sample Workflow}
workspace: sandbox
workflow:
dag:
- name: ${dbt-jobname}
title: ${title}
spec:
stack: dbt+python:1.0
compute: ${compute-name}
resources:
requests:
cpu: 1000m
memory: 250Mi
limits:
cpu: 1000m
memory: 250Mi
logLevel: ${log-level}
stackSpec:
#stack specific section
name: dbt-workflow-v01
version: v1
type: workflow
tags:
- tag1
description: DBT Sample Workflow
workspace: sandbox
workflow:
dag:
- name: dbt-workflow-job-v2
title: dbt Sample Test Jobs
spec:
stack: dbt+python:1.0
compute:
resources:
requests:
cpu: 1000m
memory: 250Mi
limits:
cpu: 1000m
memory: 250Mi
logLevel: INFO
stackSpec:
#stack specific section
Declare the configuration for DBT stackSpec
section¶
The Workflow Resource comprises of a stackSpec
section (or a mapping) containing the attributes of the Stack to be orchestrated. Specifically within the dbt Stack context, the stackSpec includes diverse configuration file.
The manifest below shows a sample structure of the DBT stackSpec
Section:
stackSpec:
profiles:
my_new_project:
target: ${dev}
outputs:
dev:
type: ${data platform type}
#database- specific connection details
dbt_project:
name: ${project_name}
version: '1.0.0'
config-version: 2
model-paths: ${path of the model}
profile: ${project_name}
target-path: "target" #directory which will store compiled SQL files
clean-targets: #directories to be removed by `dbt clean`
- "target"
- "dbt_packages"
models:
${modelname}:
+materialized: view
dbt_packages:
packages:
- git: "https://github.com/dataos/dbt101.git"
stackSpec:
profiles:
dbtlearnnew:
target: dev
outputs:
dev:
type: snowflake
account: wylos-aidb-aq8123
#account_id
user: iamgroot
password: iamgroot@123
database: SAMPLEDB
warehouse: COMPUTE_WH
schema: PUBLIC
role: DBTADMIN
threads: 4
dbt_project:
name: 'dbtlearnnew'
version: '1.0.0'
config-version: 2
model-paths: ["models"]
profile: 'dbtlearnnew'
target-path: "target" #directory which will store compiled SQL files
clean-targets: #directories to be removed by `dbt clean`
- "target"
- "dbt_packages"
models:
dbtlearnnew:
+materialized: view
dbt_packages:
packages:
- git: "https://github.com/dataos/dbt101.git"
Stack spec section¶
The DBT stackSpec
section comprises of following key attributes:
Each of these sections is mapping and comprises several section-specific attributes. The subsequent parts provide details on the necessary configurations.
Profiles section¶
The profiles.yml
file is a configuration file that specifies connection details for target databases. It includes information to connect to the data warehouse/platform.
The following table provides a comprehensive overview of the various attributes within the Profile-specific Section.
Common Attributes
Attributes | Data Type | Default Value | Possible Value | Requirement |
---|---|---|---|---|
[profile ] |
mapping | none | my_project | mandatory |
[target ] |
string | none | dev | mandatory |
[outputs ] |
mapping | none | dev | mandatory |
[type ] |
string | none | bigquery, postgres | mandatory |
[threads ] |
string | none | [1-4] | mandatory |
Project section¶
The dbt_project.yml
serves as the primary configuration file which informs DBT about the context of your project and how to transform your data.
It is crucial for DRY (Don't Repeat Yourself) analytics code. Essentially, it acts as the repository for project-wide default configurations, and all objects will inherit from it unless overridden at the model level.
#Project names should contain only lowercase characters and underscores.
name: ${project_name}
version: ${project_version}
config-version: 2
#This setting configures which "profile" dbt uses for this project.
profile: ${profile_name}
#These configurations specify where dbt should look for different types of files.
#The `model-paths` config, for example, states that models in this project can be
#found in the "models/" directory. You probably won't need to change these!
model-paths: [${model_paths}]
#Configuring models
models:
${project_name}:
${model_group}:
#Config indicated by + and applies to all files under models/example/
+materialised: ${materialisation}
name: training_dbt
version: 1.0.0
config-version: 2
profile: training_dbt
#These configurations specify where dbt should look for different types of files.
#The model-paths config, for example, states that models in this project can be
#found in the "models/" directory. You probably won't need to change these!
model-paths: [models]
#Configuring models
models:
training_dbt:
example:
#Config indicated by + and applies to all files under models/example/
+materialised: view
Attributes of project in stackSpec section
Attributes | Data Type | Default Value | Possible Value | Requirement |
---|---|---|---|---|
name |
string | none | Your project's name in snake case | mandatory |
version |
version | none | Version of your project | mandatory |
require-dbt-version |
version range | none | Restrict your project to only work with a range of dbt Core versions | optional |
profile |
string | none | The profile dbt uses to connect to your data platform | mandatory |
model-paths |
string | none | Directories to where your model and source files live | mandatory |
seed-paths |
list of strings | none | Directories to where your seed files live | optional |
test-paths |
list of strings | none | Directories to where your test files live | optional |
analysis-paths |
list of strings | none | Directories to where your analyses live | optional |
macro-paths |
list of strings | none | Directories to where your macros live | optional |
snapshot-paths |
list of strings | none | Directories to where your snapshots live | optional |
docs-paths |
list of strings | none | Directories to where your docs blocks live | optional |
clean-targets |
list of strings | none | List of clean targets for the project | optional |
models |
list of dictionaries | none | List of model configurations | mandatory |
dbt_packages section¶
The dbt_packages
section is designed to specify the dbt projects that you wish to incorporate as dependencies. These projects, referred to as dbt packages, can be seamlessly integrated into your own dbt project, fostering modularity and code sharing.
Apply the manifest using CLI¶
Use the apply command to apply the workflow using CLI.
Get Status of the Workflow
dataos-ctl resource get -t workflow -w sandbox -n dbt-worfklow-v01
#Expected Output
INFO[0000] 🔍 get...
INFO[0001] 🔍 get...complete
NAME | VERSION | TYPE | WORKSPACE | STATUS | RUNTIME | OWNER
----------------------|---------|----------|-----------|--------|---------|-------------
dbt-workflow-v01 | v1 | workflow | sandbox | active | running | tmdc
To check this information for all users in a specific Workspace, add the -a
flag to the command as shown below.
dataos-ctl resource get -t workflow -w sandbox -a
#Expected Output
INFO[0000] 🔍 get...
INFO[0001] 🔍 get...complete
NAME | VERSION | TYPE | WORKSPACE | STATUS | RUNTIME | OWNER
-------------------------|---------|----------|-----------|--------|-----------|--------------------
dbt-workflow-v01 | v1 | workflow | sandbox | active | succeeded | wonderwoman
cnt-product-demo-01 | v1 | workflow | sandbox | active | running | tmdc
cnt-product-demo-01-01 | v1 | workflow | sandbox | active | failed | otheruser
cnt-city-demo-01001 | v1 | workflow | sandbox | active | succeeded | iamgroot
Get Runtime Information
To refresh or see updates on the Workflow progress, add the -r
flag to the [get runtime](/interfaces/cli/command_reference/#get-runtime)
command:
Press Ctrl + C
to exit.
For any additional flags, use help by appending -h
with the respective command.
Case study¶
Summary¶
This dbt project aims to transform and prepare data through various stages, starting with cleaning and renaming in the staging phase, moving to intermediate transformations, building a core data layer, creating non-volatile datasets for mapping, and finally performing detailed analytics. The use cases highlight the specific objectives of each transformation, contributing to a comprehensive and well-organized data analytics workflow.
Staging Folder¶
select
id as customer_id,
concat(first_name, ' ', last_name) as customer_name,
current_timestamp() as dp_load_date
from {{ source('jaffle_shop', 'customers') }}
#Renamming columns for deeper understanding
select
id as order_id,
user_id as customer_id,
order_date,
status,
_etl_loaded_at
from {{ source('jaffle_shop', 'orders') }}
Intermediate Folder¶
Payment Type Analysis:
- Analyze payment data to understand the distribution of cash and credit payments.
Total Payment Amount Calculation:
-
Calculate the total amount paid for orders with a "success" payment status.
-
in_payment_type_amount.sql
: - Create a CTE (
order_payments
) based on thestg_stripe_payment
. - Summarize payment amounts based on different payment types and order statuses.
-- int_payment_type_amount.sql
with order_payments as (
select * from {{ ref('stg_stripe_payment') }}
)
select
order_id,
sum(
case
when payment_type = 'cash' and
status = 'success'
then amount
else 0
end
) as cash_amount,
sum(
case
when payment_type = 'credit' and
status = 'success'
then amount
else 0
end
) as credit_amount,
sum(case
when status = 'success'
then amount
end
) as total_amount
from order_payments
group by 1
Marts/Core Folder¶
- Customer Dimension Creation:
- Create a dimensional view of customers, incorporating information from the staged customer data.
- Customer Data Exploration:
- Use the
dim_customers
view for exploratory analysis and understanding customer attributes.
- Use the
dim_customers.sql
:- Transformation Objective:
- Create a CTE (
customers
) based on thestg_jaffle_shop_customers
. - Select all columns from the
customers
CTE.
- Create a CTE (
- Transformation Objective:
with customers as (
select * from {{ ref('stg_jaffle_shop_customers') }}
)
select * from customers
Order Fact Creation:
- Create a fact table (
fact_orders
) representing detailed order information, including payment types and amounts. - Order Completion Flag:
- Introduce a flag (
is_order_completed
) indicating whether an order is completed or not based on its status.
- Introduce a flag (
-- fact_table
with orders as (
select * from {{ ref('stg_jaffle_shop_orders' )}}
),
payment_type_orders as (
select * from {{ ref('in_payment_type_amount_per_order' )}}
)
select
ord.order_id,
ord.customer_id,
ord.order_date,
pto.cash_amount,
pto.credit_amount,
pto.total_amount,
case
when status = 'completed'
then 1
else 0
end as is_order_completed
from orders as ord
left join payment_type_orders as pto ON ord.order_id = pto.order_id
fact_orders.sql
:- Transformation Objective:
- Create CTEs (
orders
andpayment_type_orders
) based on thestg_jaffle_shop_orders
andin_payment_type_amount_per_order
. - Perform final transformations, including joining relevant CTEs and calculating additional fields.
- Create CTEs (
- Transformation Objective:
Possible Business Use Cases:
- Customer Paid Amount Aggregation:
- Aggregate the total paid amount for each customer based on completed orders.
- Customer Classification:
- Classify customers into different categories (Regular, Bronze, Silver, Gold) based on their total paid amounts.
with fact_orders as (
select * from {{ ref('fact_orders')}}
),
dim_customers as (
select * from {{ ref('dim_customers' )}}
),
total_amount_per_customer_on_orders_complete as (
select
cust.customer_id,
cust.customer_name,
SUM(total_amount) as global_paid_amount
from fact_orders as ord
left join dim_customers as cust ON ord.customer_id = cust.customer_id
where ord.is_order_completed = 1
group by cust.customer_id, customer_name
),
customer_range_per_paid_amount as (
select * from {{ ref('seed_customer_range_per_paid_amount' )}}
)
select
tac.customer_id,
tac.customer_name,
tac.global_paid_amount,
crp.classification
from total_amount_per_customer_on_orders_complete as tac
left join customer_range_per_paid_amount as crp
on tac.global_paid_amount >= crp.min_range
and tac.global_paid_amount <= crp.max_range
Seeds Folder¶
Objective: Create non-volatile datasets for mapping and enriching data.
Use Case:
In the seeds
folder, we store small datasets to map values.
seed_customer_range_per_paid_amount.csv
:- Transformation Objective:
- Define ranges for customer classifications based on total paid amounts.
- Ranges include Regular, Bronze, Silver, and Gold classifications.
Example:
#with ranges mapping
min_range,max_range,classification
0,9.999,Regular
10,29.999,Bronze
30,49.999,Silver
50,9999999,Gold
Analyses Folder¶
Objective: Perform detailed analytics and generate insights for reporting.
Use Case:
In the analyses
folder, we create SQL queries to analyze and derive insights.
customer_range_based_on_total_paid_amount.sql
:- Transformation Objective:
- Create CTEs based on
fact_orders
,dim_customers
,total_amount_per_customer_on_orders_complete
, andcustomer_range_per_paid_amount
. - Analyze and categorize customers based on their total paid amounts.
- Create CTEs based on
- Transformation Objective:
Example:
#based on the completed orders and the total amount paid, the customer classification range.
with fct_orders as (
select * from {{ ref('fct_orders')}}
),
dim_customers as (
select * from {{ ref('dim_customers' )}}
),
total_amount_per_customer_on_orders_complete as (
select
cust.customer_id,
cust.first_name,
SUM(total_amount) as global_paid_amount
from fct_orders as ord
left join dim_customers as cust ON ord.customer_id = cust.customer_id
where ord.is_order_completed = 1
group by cust.customer_id, first_name
),
customer_range_per_paid_amount as (
select * from {{ ref('seed_customer_range_per_paid_amount' )}}
)
select
tac.customer_id,
tac.first_name,
tac.global_paid_amount,
crp.classification
from total_amount_per_customer_on_orders_complete as tac
left join customer_range_per_paid_amount as crp
on tac.global_paid_amount >= crp.min_range
and tac.global_paid_amount <= crp.max_range
It will give each customer the total amount paid and its corresponding range.