Build a Semantic Model¶
Overview
This guide outlines the end-to-end process for building and deploying a semantic model (Lens) For your Data Product.
📘 Scenario¶
To provide business users with consistent, trustworthy metrics across BI tools and applications, you need to create a semantic model (Lens) that encapsulates business logic and definitions. This model translates complex data structures into familiar business terms, facilitating self-service analytics. Leverage your source-aligned Data Product, which processes raw data into structured formats, ensuring the semantic model is grounded in accurate, well-prepared data sources. Metrics become accessible via APIs for use in BI, AI, and embedded analytics tools.
Quick concepts¶
Semantic models aim to represent data in a way that is easily understandable by business users. They provide a logical structure that defines key entities, their relationships.
Importance of semantic models:
- Improved Data Accessibility: Semantic models present data in a understandable way, using a common business language that enables users of all technical levels to access and use data effectively, promoting self-service in Data Product development.
- Enhanced Data Discovery: By organizing data clearly, semantic models simplify the process of discovering and exploring data, especially in large or complex datasets, helping users quickly find relevant information.
- Easier Data Integration: Semantic models define common meanings and relationships between data elements, making it easier to integrate and maintain consistency across various data sources and systems.
- Support for Data Governance: These models enforce data quality standards, ensuring accuracy and reliability, which is essential for regulatory compliance and maintaining trust in Data Products.
- Empowering Business Users: By simplifying technical complexities, semantic models allow business users to access and analyze data independently, fostering a data-driven culture and reducing reliance on IT teams.
Based on the identified goals of analysis, value objectives, and the key factors or drivers that will influence our value objectives, create the semantic model with the following steps.
Creating the semantic model¶
Once your conceptual model is finalized, the next step is to implement it within the Lens framework. This process involves setting up a clear Lens folder structure, defining SQL scripts for data extraction, organizing tables, and implementing dimensions, measures, and metrics to reflect your business logic.
Step 1: Set Up the Semantic Model Folder Structure¶
A well-structured directory is key to managing your model components.
🎯 Your Action:
Set up your model folder:
semantic_model
├── sqls
│ ├── customer.sql
│ ├── product.sql
│ └── purchase.sql
├── tables
│ ├── customer.yml
│ ├── product.yml
│ └── purchase.yml
├── views
│ ├── total_spend.yml
│ ├── purchase_frequency.yml
│ └── cross_sell_opportunities.yml
└── user_groups.yml
Step 2: Write SQL Scripts for Tables¶
These SQL scripts extract relevant fields from your input datasets.
🎯 Your Actions:
- Create SQL scripts for each entity.
- Add SQL files to the
sqls
folder.
customer.sql
product.sql
purchase.sql
# Replace xx with your initials
SELECT
customer_id as p_customer_id,
cast(purchase_date as timestamp) as purchase_date,
recency as recency_in_days,
mntwines,
mntmeatproducts,
mntfishproducts,
mntsweetproducts,
mntgoldprods,
mntfruits,
numdealspurchases,
numwebpurchases,
numcatalogpurchases,
numstorepurchases,
numwebvisitsmonth,
numwebpurchases + numcatalogpurchases + numstorepurchases + numstorepurchases as purchases,
(mntwines+mntmeatproducts+mntfishproducts+mntsweetproducts+mntgoldprods+mntfruits) as spend
FROM
postgresxx.public.purchase_data
Step 3: Define Manifest Files for Tables¶
This tables
folder contains logical table definitions in YAML format. These definitions help structure your data into business entities.
1. Table definition section¶
This is the foundational section where you define the core table properties.
-
name: Specifies the table name.
-
sql: Points to the SQL defined in the
/sql
folder. for example:customer.sql
-
description: Provides context about the table.
-
public: Indicates whether the table is publicly accessible.
After defining the base table, you define measures, dimensions, joins and segments to provide meaningful analysis. Each section serves a specific purpose to ensure a well-structured and accessible data model.
2. Joins section¶
Defines relationships between tables, enabling connected data insights. The relationship could be one-to-one
, one-to-many
.
-
name: The name of the table being joined.
-
relationship: Specifies the relationship type (e.g., one-to-many).
-
sql: Defines the join condition in SQL syntax.
3. Dimensions section¶
The Dimension section captures the attributes or fields of the table, detailing metadata and properties for each dimension. These attributes provide critical context and usability for downstream analysis. Here, you add (attributes like customer_id
, product_id
).
-
name: Specifies the name of the dimension.
-
type: Defines the data type of the dimension (e.g., number, string).
-
column: Maps the dimension to the corresponding database column.
-
description: Provides a brief explanation of the dimension.
-
primary_key: Indicates whether the dimension is a primary key.
-
public: Specifies if the dimension is publicly accessible.
-
meta.secure: Ensures sensitive data is masked for specified user groups.
-
sql: Custom SQL logic to create derived dimensions.
4. Measures section¶
The Measure section defines aggregated metrics derived from the table data such as total_customers
, total_products
. These measures provide quantitative insights and often include policies to protect sensitive information.
-
name: Specifies the measure name.
-
sql: Contains the aggregation logic.
-
type: Indicates the data type of the measure (e.g., number, string).
-
description: Provides additional details about the measure.
-
secure: Highlights measures that include data policy logic to protect sensitive information. For instance, in the
customer
table a data policy is applied to secure the maritial_status column of theproduct
table. The applied data policy redacts the data from the users in the ‘dataconsumer’ group.
5. Segments section¶
The Segments section defines filters or conditions that segment data based on specific criteria, enabling the creation of subsets of the dataset for more granular analysis. Segments are often used to focus on specific groups of data that meet certain conditions, like customers from a particular region or products from a certain category.
In the YAML manifest, each segment is defined by:
-
name: Specifies the segment's name, which serves as an identifier for the condition or subset of data.
-
sql: Contains the SQL logic or condition that defines the segment. The SQL condition is used to filter the data, ensuring that only the records that meet the condition are included in the segment. In the provided YAML example, the condition is {TABLE}.state = 'Illinois', which means only rows where the state column equals "Illinois" will be included in this segment.
🎯 Your actions:
- In the tables folder, define logical table definitions in YAML format. The templates show how to define tables referencing the SQL scripts created in the previous step and defining dimensions, joins, measures, and segments.
- Add these files to the
tables
folder.
Click to view YAML definition for the `customer` table
tables:
- name: customer
sql: {{ load_sql('customer') }}
description: "This table stores key details about the customers, including their personal information, income, and classification into different risk segments. It serves as the central reference point for customer-related insights."
public: true
joins:
- name: purchase
relationship: one_to_many
sql: "{TABLE.customer_id}= {purchase.p_customer_id}"
dimensions:
- name: customer_id
type: number
column: customer_id
description: "The unique identifier for each customer."
primary_key: true
public: true
- name: birth_year
type: number
column: birth_year
description: "The birth year of the customer, used for demographic analysis and customer segmentation."
- name: education
type: string
column: education
description: "The educational level of the customer, which could be used for profiling and targeted campaigns."
- name: marital_status
type: string
column: marital_status
description: "The marital status of the customer, which may provide insights into purchasing behavior and lifestyle preferences."
- name: income
type: number
column: income
description: "The annual income of the customer in the local currency, which is often a key factor in market segmentation and purchasing power analysis."
- name: country
type: string
column: country
description: "The country where the customer resides, providing geographic segmentation and analysis opportunities."
- name: customer_segments
type: string
sql: >
CASE
WHEN random() < 0.33 THEN 'High Risk'
WHEN random() < 0.66 THEN 'Moderate Risk'
ELSE 'Low Risk'
END AS
description: "Risk-based customer segments derived from the customer_id, used to categorize customers into high, moderate, and low risk groups for targeted campaigns or analysis."
measures:
- name: total_customers
sql: "COUNT( Distinct {customer_id})"
type: number
description: "The total number of customers in the dataset, used as a basic measure of customer volume."
segments:
- name: country_india
sql: country = 'India'
description: This segment filters customers by a specific country India.
meta:
secure:
user_groups:
includes:
- india
excludes:
- default
- name: country_usa
sql: country = 'USA'
description: This segment filters customers by a specific country USA.
meta:
secure:
user_groups:
includes:
- usa
excludes:
- default
Click to view YAML definition for the `product` table
<details>
<summary>Click to view YAML definition for the `product` table</summary>
```yaml
tables:
- name: product
sql: {{ load_sql('product') }}
description: "This table stores the product information."
public: true
dimensions:
- name: product_customer_id
type: string
column: product_customer_id
description: "The unique identifier representing the combination of a customer and a product."
- name: product_id
type: string
column: product_id
description: "The unique identifier for the product associated with a customer."
primary_key: true
- name: product_category
type: string
column: product_category
description: "The category of the product, such as Wines, Meats, Fish, etc."
- name: product_name
type: string
column: product_name
description: "The name of the product associated with the customer."
- name: price
type: number
column: price
description: "The price of the product assigned to the customer, in monetary units."
- name: product_affinity_score
sql: ROUND(50 + random() * 50, 2)
type: number
description: "customer who purchases one product category will also purchase another category."
- name: purchase_channel
type: string
sql: >
CASE
WHEN random() < 0.33 THEN 'Web'
WHEN random() < 0.66 THEN 'Store'
ELSE 'Catalog'
END
description: "Describes how the purchase was made (e.g., Web, Store, Catalog)."
measures:
- name: total_products
sql: product_id
type: count_distinct
description: "The number of products."
</details>
<details>
<summary>Click to view YAML definition for the `purchase.yaml` table</summary>
```yaml
tables:
- name: purchase
sql: {{ load_sql('purchase') }}
description: "This table captures detailed purchase behavior of customers, including their transaction frequency, product category spends, and web interaction history. It serves as the foundation for customer segmentation, recency analysis, and churn prediction."
public: true
joins:
- name: product
relationship: many_to_one
sql: "{TABLE.p_customer_id} = {product.product_customer_id} "
dimensions:
- name: p_customer_id
type: number
column: p_customer_id
description: "The unique identifier for a customer."
primary_key: true
- name: purchase_date
type: time
column: purchase_date
description: "The date when the customer made their last purchase."
- name: recency_in_days
type: number
column: recency_in_days
description: "The number of days since the customer’s last purchase."
public: false
- name: mntwines
type: number
column: mntwines
description: "The total amount spent by the customer on wine products."
public: false
- name: mntmeatproducts
type: number
column: mntmeatproducts
description: "The total amount spent by the customer on meat products."
public: false
- name: mntfishproducts
type: number
column: mntfishproducts
description: "The total amount spent by the customer on fish products."
public: false
- name: mntsweetproducts
type: number
column: mntsweetproducts
description: "The total amount spent by the customer on sweet products."
public: false
- name: mntgoldprods
type: number
column: mntgoldprods
description: "The total amount spent by the customer on gold products."
public: false
- name: mntfruits
type: number
column: mntfruits
description: "The total amount spent by the customer on fruit products."
public: false
- name: numdealspurchases
type: number
column: numdealspurchases
description: "The number of purchases made by the customer using deals or discounts."
public: false
- name: numwebpurchases
type: number
column: numwebpurchases
description: "The number of purchases made by the customer through the web."
public: false
- name: numcatalogpurchases
type: number
column: numcatalogpurchases
description: "The number of purchases made by the customer through catalogs."
public: false
- name: numstorepurchases
type: number
column: numstorepurchases
description: "The number of purchases made by the customer in physical stores."
public: false
- name: numwebvisitsmont
type: number
column: numwebvisitsmont
description: "The number of times the customer visited the website in the last month."
public: false
- name: purchases
type: number
column: purchases
public: false
- name: spend
type: number
column: spend
public: false
- name: country_name
type: string
sql: "{customer.country}"
description: "The name of the country where the customer is located."
measures:
- name: recency
sql: datediff(current_date,date(purchase_date))
type: min
description: Number of days since the customers last purchase.
- name: purchase_frequency
sql: purchases
type: sum
description: The number of purchases made by the customer in a specific period.
- name: total_spend
sql: spend
type: sum
description: The total amount a customer has spent across all purchases.
- name: average_spend
sql: sum(spend)/nullif(sum(purchases),0)
type: number
description: The average amount spent per transaction by the customer.
- name: churn_probability
sql: "CASE WHEN {recency} < 30 AND {total_spend} > 500 THEN 0.9 WHEN {recency} BETWEEN 30 AND 90 AND {total_spend} BETWEEN 100 AND 500 THEN 0.5 WHEN {recency} > 90 OR {total_spend} < 100 THEN 0.2 ELSE 0.1 END"
type: number
description: "The predicted likelihood that a customer will churn, based on purchase behavior and recency."
- name: cross_sell_opportunity_score
sql: >
sum((mntwines * 0.25 + mntmeatproducts * 0.2 + mntfishproducts * 0.15 + mntsweetproducts * 0.1 + mntgoldprods * 0.3 + mntfruits * 0.2))
/sum(numwebpurchases + numcatalogpurchases + numstorepurchases + 1)
description: "The potential for cross-selling a secondary product based on previous purchase patterns."
type: number
Step 4: Define Manifest Files for Business Views¶
The views
folder contains business views, encapsulating the identified drivers and metrics to provide targeted insights.
1. Meta section¶
The Meta section provides essential metadata for the metric, which includes the title and tags. This section helps categorize the metric within specific domains, use cases, and tier.
2. Metric Section¶
The Metric section defines the actual measure being tracked and the rules for how it is calculated. This includes the expression, time window, and any exclusions.
3. Tables Section¶
The Tables section defines the data sources and the structure for the metric. It specifies the join paths, the relationships between different tables, and which columns should be included in the metric calculation.
🎯 Your Actions:
- Create business views within the views folder using the given templates.
- Add them to
views
folder.
Click to view YAML definition for the `Cross-Sell Opportunity Score` view
views:
- name: cross_sell_opportunity_score
description: This metric calculate the potential for cross-selling a secondary product to customers based on past purchases.
public: true
meta:
title: Cross-Sell Opportunity Score
tags:
- DPDomain.Marketing
- DPUsecase.Customer Segmentation
- DPUsecase.Product Recommendation
- DPTier.Consumer Aligned
metric:
expression: "*/45 * * * *"
timezone: "UTC"
window: "day"
excludes:
- mntwines
- mntmeatproducts
- mntfishproducts
- mntsweetproducts
- mntgoldprods
- mntfruits
- numwebpurchases
- numcatalogpurchases
- numstorepurchases
tables:
- join_path: purchase
prefix: true
includes:
- cross_sell_opportunity_score
- mntwines
- mntmeatproducts
- mntfishproducts
- mntsweetproducts
- mntgoldprods
- mntfruits
- numwebpurchases
- numcatalogpurchases
- numstorepurchases
- customer_id
- purchase_date
- join_path: product
prefix: true
includes:
- product_category
- product_name
Click to view YAML definition for the `Purchase Frequency` view
views:
- name: purchase_frequency
description: This metric calculates the average number of times a product is purchased by customers within a given time period
public: true
meta:
title: Product Purchase Frequency
tags:
- DPDomain.Marketing
- DPUsecase.Customer Segmentation
- DPUsecase.Product Recommendation
- DPTier.Consumer Aligned
metric:
expression: "*/45 * * * *"
timezone: "UTC"
window: "day"
excludes:
- purchases
tables:
- join_path: purchase
prefix: true
includes:
- purchase_date
- customer_id
- purchase_frequency
- purchases
- join_path: product
prefix: true
includes:
- product_category
- product_name
Click to view YAML definition for the `Total Spend` view
views:
- name: total_spending
description: This metric measures how many marketing campaigns a customer has engaged with.
public: true
meta:
title: Customer Spending by Product Category
tags:
- DPDomain.Marketing
- DPUsecase.Customer Segmentation
- DPUsecase.Product Recommendation
- DPTier.Consumer Aligned
metric:
expression: "*/45 * * * *"
timezone: "UTC"
window: "day"
excludes:
- spend
tables:
- join_path: purchase
prefix: true
includes:
- purchase_date
- customer_id
- total_spend
- spend
- join_path: product
prefix: true
includes:
- product_category
- product_name
Step 5: Define Access Control¶
The user_groups.yml
file manages access permissions for different user groups within your data model.
🎯 Your Actions:
- Create a user_groups.yml file with the given template.
- Define user groups and their API scopes.
- Add users to each group.
- Add a default group at the end.
user_groups:
- name: usa
api_scopes:
- meta
- data
- graphql
- jobs
- source
includes:
- users:id:manishagrawal
- name: india
api_scopes:
- meta
- data
- graphql
- jobs
- source
includes:
- users:id:nandapage
- name: default
api_scopes:
- meta
- data
- graphql
- jobs
- source
includes: "*"
Step 6: Create a Lens Resource Manifest File¶
This file is your blueprint for configuring your Lens Resource on DataOS. This file contains a meta section and a Lens Lens-specific section to define all the technical configurations, such as secrets, code base, repo, and source configuration that your Lens will be mapped to.
🎯 Your Actions:
- Create a manifest file for your Lens Resource.
- Provide metadata, source information, and secrets.
- Specify URL and base directory.
version: v1alpha
name: "productaffinity-xx"
layer: user
type: lens
tags:
- lens
description: This semantic model provides comprehensive insights for product affinity analysis.
lens:
compute: runnable-default
secrets:
- name: bitbucket-cred
allKeys: true
source:
type: depot
name: postgresxx
repo:
url: <url> # Ex: https://bitbucket.org/tmdc/product-affinity-training
lensBaseDir: <basedir> # Ex: product-affinity-training/build/semantic-model/model
syncFlags:
- --ref=main
api: # optional
replicas: 1 # optional
logLevel: info # optional
# envs:
# LENS2_SCHEDULED_REFRESH_TIMEZONES: "UTC,America/Vancouver,America/Toronto"
resources: # optional
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 500Mi
worker: # optional
replicas: 1 # optional
logLevel: debug # optional
# envs:
# LENS2_SCHEDULED_REFRESH_TIMEZONES: "UTC,America/Vancouver,America/Toronto"
resources: # optional
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 500Mi
router: # optional
logLevel: info # optional
# envs:
# LENS2_SCHEDULED_REFRESH_TIMEZONES: "UTC,America/Vancouver,America/Toronto"
resources: # optional
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 500Mi
metric:
logLevel: info
Step 7: Push the code to Repo¶
Push the code to Repo and update the repo URL and base directory location under lens section. Use clear commit messages to describe changes effectively.
Next step¶
Your semantic model is now in place. The final milestone is to deploy your first consumer-aligned Data Product.