Validity checks¶
Ensuring data validity is crucial for maintaining the integrity and reliability of datasets. In Soda, validity checks can be defined to monitor and enforce data correctness. Below are explanations and sample configurations for various types of validity checks.
1. Check for invalid values based on a set of valid options: This check verifies that values in the specified column belong to an acceptable set of predefined options.
In the following example, the check verifies that the status
column contains only 'active', 'inactive', or 'pending'.
- invalid_count(status) = 0:
valid values: [active, inactive, pending]
name: Status should have valid values
attributes:
category: Validity
title: Status column should contain only valid values
2. Check for invalid values based on a regular expression: Ensure that a column's values match a specific pattern.
The following check ensures that all entries in the email
column match the standard email format.
- invalid_count(email) = 0:
valid regex: '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}$'
name: Email should have a valid format
attributes:
category: Validity
title: Email column should contain valid email addresses
3. Check for invalid values based on length constraints: This check ensure that a column's values meet specified length requirements.
Here, the check verifies that the username
column contains values with a length between 5 and 15 character.
- invalid_count(username) = 0:
valid min length: 5
valid max length: 15
name: Username should have a valid length
attributes:
category: Validity
title: Username should be between 5 and 15 characters long
4. Check for invalid values based on numerical ranges: This check ensures that numerical columns have values within a specified range.
The following check ensures that the age
column contains values between 18 and 65.
- invalid_count(age) = 0:
valid min: 18
valid max: 65
name: Age should be within the valid range
attributes:
category: Validity
title: Age should be between 18 and 65
5. Check for invalid values based on format and length constraints: It combines length constraint conditions for comprehensive checks.
The following check ensures that the product_code
column matches the specified pattern and has a length of 8 characters.
- invalid_count(product_code) = 0:
valid regex: '^[A-Z]{3}-\\d{4}$'
valid min length: 8
valid max length: 8
name: Product code should have a valid format and length
attributes:
category: Validity
title: Product code should match the pattern and have a length of 8 characters
Incorporating these validity checks into workflows enables proactive monitoring and maintenance of dataset correctness, ensuring compliance with organizational data quality standards.