Completeness checks¶
Ensuring data completeness is vital for maintaining the integrity and reliability of datasets. In Soda, completeness checks can be defined to monitor and enforce the presence of necessary data. Below are explanations and sample configurations for various types of completeness checks.
1. Check for missing values in a column: Ensure that a specific column does not contain any missing values.
In this example, the check verifies that the customer_no
column has no missing entries.
- missing_count(customer_no) = 0:
name: Customer number should not have any missing values
attributes:
category: Completeness
title: Ensure customer number is present in all records
2. Check for missing values with specific missing indicators: Identify missing values that are represented by specific indicators, such as 'NA' or 'n/a'.
The following check ensures that the license_type
column does not contain 'NA' or 'n/a' as values.
- missing_count(license_type) < 1:
missing values: [NA, n/a]
name: License type should not have 'NA' or 'n/a' as values
attributes:
category: Completeness
title: Detect and handle specific missing value indicators in license type
3. Check for missing values matching a regular expression: Detects missing values that match a specific pattern, such as improperly formatted dates.
Here, the check identifies entries in the license_expiry_date
column that match the specified regular expression, indicating potential missing or invalid data.
- missing_count(license_expiry_date) = 0:
missing regex: '(0?[0-9]|1[012])[/](0?[0-9]|[12][0-9]|3[01])[/](0000|(19|20)?\\d\\d)'
name: License expiry date should not have improperly formatted dates
attributes:
category: Completeness
title: Identify and address improperly formatted dates in license expiry date
4. Check for missing values as a percentage of total entries: Ensure that the percentage of missing values in a column does not exceed a specified threshold.
This check verifies that the phone_number
column has no missing values, ensuring complete contact information.
- missing_percent(phone_number) = 0:
name: Phone number should not have any missing values
attributes:
category: Completeness
title: Ensure phone number is present in all records
5. Check for missing values in multiple columns: Ensure that multiple columns do not contain any missing values.
These checks ensure that both first_name
and last_name
columns are complete, maintaining the integrity of personal information.
- missing_count(first_name) = 0:
name: First name should not have any missing values
attributes:
category: Completeness
title: Ensure first name is present in all records
- missing_count(last_name) = 0:
name: Last name should not have any missing values
attributes:
category: Completeness
title: Ensure last name is present in all records
6. Check for missing values with custom sample limits: Limit the number of failed row samples collected for analysis when missing values are detected.
This check ensures that the email_address
column does not have more than 50% missing values and limits the collection of failed row samples to 2 for analysis.
- missing_percent(email_address) < 50:
samples limit: 2
name: Email address should not have more than 50% missing values
attributes:
category: Completeness
title: Monitor and limit missing values in email address
Incorporating these completeness checks into workflows enables proactive monitoring and maintenance of essential data presence in datasets, ensuring compliance with organizational data quality standards.