Running Data Profiling with Soda Stack¶
Data profiling helps you understand the quality, structure, and completeness of your datasets. Using Soda Stack, you can automate profiling jobs that scan datasets, run data quality checks, and generate profiling summaries — all as part of a workflow.
The following example demonstrates how to configure and run a profiling workflow using Soda Stack within a DataOS environment.
Sample Workflow Manifest¶
name: profile-soda
version: v1
type: workflow
tags:
- profile
description: This job profiles datasets using Soda checks and profiling features.
workflow:
dag:
- name: sample-profile-soda-01
title: Sample Data Profiling with Soda
spec:
stack: soda+python:1.0
soda:
- dataset: dataos://lakehouse:retail/customer
checks:
- row_count between 10 and 1000:
attributes:
category: Accuracy
title: Validate dataset size between 10 and 1000 records
- missing_count(birth_date) = 0:
attributes:
category: Completeness
title: Ensure birth_date field is not missing
# Example for validity check (commented)
# - invalid_percent(phone) < 1%:
# valid format: phone number
# attributes:
# category: Validity
# title: Validate phone number format
- invalid_count(number_cars_owned) = 0:
valid min: 1
valid max: 6
attributes:
category: Validity
title: Ensure number of cars owned is between 1 and 6
- duplicate_count(phone) = 0:
attributes:
category: Uniqueness
title: Ensure no duplicate phone numbers
profile:
columns:
- "*"
engine: minerva
- dataset: dataos://lakehouse:retail/customer_360
checks:
- row_count between 10 and 1000:
attributes:
category: Accuracy
title: Validate record count range for customer_360
- missing_count(birth_date) = 0:
attributes:
category: Completeness
title: Ensure birth_date is populated
- invalid_percent(phone) < 1%:
valid format: phone number
attributes:
category: Validity
title: Check phone number format accuracy
- invalid_count(number_cars_owned) = 0:
valid min: 1
valid max: 6
attributes:
category: Validity
title: Verify number_cars_owned value range
- duplicate_count(phone) = 0:
attributes:
category: Uniqueness
title: Detect duplicate phone entries
profile:
columns:
- "*"
Attribute Details¶
| Attribute | Description |
|---|---|
| name | Defines the workflow name for profiling. |
| type | Specifies that this configuration represents a workflow. |
| tags | Helps categorize and filter profiling workflows. |
| stack | Indicates the runtime stack — soda+python:1.0 combines Soda for checks with Python for orchestration. |
| dataset | References the dataset to be profiled. |
| checks | Defines various data quality checks to be executed during the profiling run. |
| profile | Specifies which columns to include in the profiling process. "*" means all columns are profiled. |
| engine | Indicates the query execution engine (minerva in this example). |
Example Use Cases¶
- Validate completeness of key columns before loading data into analytics tables.
- Detect duplicate or invalid records in CRM or customer datasets.
- Monitor freshness and accuracy of operational data feeds.