Validity checks¶
Ensuring data validity is crucial for maintaining the integrity and reliability of datasets. In Soda, validity checks can be defined to monitor and enforce data correctness. Below are explanations and sample configurations for various types of validity checks.
Check for invalid values based on a set of valid options: This check verifies that values in the specified column belong to an acceptable set of predefined options.
In the following example, the check verifies that the status column contains only 'active', 'inactive', or 'pending'.
- invalid_count(status) = 0:
valid values: [active, inactive, pending]
name: Status should have valid values
attributes:
category: Validity
title: Status column should contain only valid values
Check for invalid values based on a regular expression: Ensure that a column's values match a specific pattern.
The following check ensures that all entries in the email column match the standard email format.
- invalid_count(email) = 0:
valid regex: '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}$'
name: Email should have a valid format
attributes:
category: Validity
title: Email column should contain valid email addresses
Check for invalid values based on length constraints: This check ensure that a column's values meet specified length requirements.
Here, the check verifies that the username column contains values with a length between 5 and 15 character.
- invalid_count(username) = 0:
valid min length: 5
valid max length: 15
name: Username should have a valid length
attributes:
category: Validity
title: Username should be between 5 and 15 characters long
Check for invalid values based on numerical ranges: This check ensures that numerical columns have values within a specified range.
The following check ensures that the age column contains values between 18 and 65.
- invalid_count(age) = 0:
valid min: 18
valid max: 65
name: Age should be within the valid range
attributes:
category: Validity
title: Age should be between 18 and 65
Check for invalid values based on format and length constraints: It combines length constraint conditions for comprehensive checks.
The following check ensures that the product_code column matches the specified pattern and has a length of 8 characters.
- invalid_count(product_code) = 0:
valid regex: '^[A-Z]{3}-\\d{4}$'
valid min length: 8
valid max length: 8
name: Product code should have a valid format and length
attributes:
category: Validity
title: Product code should match the pattern and have a length of 8 characters
Some other Examples
# Check for valid values
- invalid_count(customer_id) = 0:
invalid regex: ^(?!\d{8}$).+$
attributes:
category: Validity
- invalid_count(email_address) = 0:
valid format: email
attributes:
category: Validity
- invalid_percent(english_education) = 0:
valid length: 100
attributes:
category: Validity
- invalid_percent(total_children) <= 2:
valid max: 6
attributes:
category: Validity
- invalid_percent(marital_status) = 0:
valid max length: 10
attributes:
category: Validity
- invalid_count(number_cars_owned) = 0:
valid min: 1
attributes:
category: Validity
- invalid_percent(marital_status) = 0:
valid min length: 1
attributes:
category: Validity
- invalid_count(house_owner_flag) = 0:
valid values: [0, 1]
attributes:
category: Validity
# a check with a fixed threshold
- invalid_count(email_address) = 0:
valid format: email
attributes:
category: Validity
title: Fixed threshold
# a check with a relative threshold
- invalid_percent(english_education) < 3%:
valid max length: 100
attributes:
category: Validity
title: Relative threshold
Relative threshold
When Soda scans a column in your dataset, it automatically separates all values in the column into one of three categories:
- missing
- invalid
- valid
Soda then performs two calculations. The sum of the count for all categories in a column is always equal to the total row count for the dataset.
missing_count(column_name) + invalid_count(column_name) + valid_count(column_name) = row_count
Similarly, a calculation that uses percentage always adds up to a total of 100 for the column.
missing_percent(name) + invalid_percent(name) + valid_percent(name) = 100
These calculations enable you to write checks that use relative thresholds.
In the example above, the invalid values of the english_education column must be less than three percent of the total row count, or the check fails.
Percentage thresholds are between 0 and 100, not between 0 and 1.
List of validity metrics¶
| Metric | Column config keys | Description | Supported data types |
|---|---|---|---|
invalid_count |
invalid format, invalid values, valid format, valid length, valid max, valid max length, valid min, valid min length, valid values |
The number of rows in a column that contain values that are not valid. | number, text, time |
| (same as above) | invalid regex, valid regex |
The number of rows in a column that contain values that are not valid. | text |
invalid_percent |
invalid format, invalid values, valid format, valid length, valid max, valid max length, valid min, valid min length, valid values |
The percentage of rows in a column, relative to the total row count, that contain values that are not valid. | number, text, time |
| (same as above) | invalid regex, valid regex |
The percentage of rows in a column, relative to the total row count, that contain values that are not valid. | text |
List of configuration keys¶
The column configuration key:value pair defines what SodaCL ought to consider as valid values.
| Column config key | Description | Values |
|---|---|---|
| invalid format | Defines the format of a value that Soda ought to register as invalid. Only works with columns that contain data type TEXT. | See List of valid formats. |
| invalid regex | Specifies a regular expression to define your own custom invalid values. | regex, no forward slash delimiters |
| invalid values | Specifies the values that Soda ought to consider invalid. | — |
| valid format | Defines the format of a value that Soda ought to register as valid. Only works with columns that contain data type TEXT. | See List of valid formats. |
| valid length | Specifies a valid length for a string. Works with columns that contain data type TEXT, and also with INTEGER on most databases, where implicit casting from string to integer is supported. Note: PostgreSQL does not support this behavior, as it does not implicitly cast strings to integers for this use case. | integer |
| valid max | Specifies a maximum numerical value for valid values. | integer or float |
| valid max length | Specifies a valid maximum length for a string. Only works with columns that contain data type TEXT. | integer |
| valid min | Specifies a minimum numerical value for valid values. | integer or float |
| valid min length | Specifies a valid minimum length for a string. Only works with columns that contain data type TEXT. | integer |
| valid regex | Specifies a regular expression to define your own custom valid values. | regex, no forward slash delimiters |
| valid values | Specifies the values that Soda ought to consider valid. | values in a list |
List of valid formats¶
- Though table below lists valid formats, the same apply for invalid formats.
- Valid formats apply only to columns using data type TEXT, not DATE or NUMBER.
| Format | Description / Example |
|---|---|
| credit card number | Four four-digit numbers separated by spaces. Four four-digit numbers separated by dashes. Sixteen-digit number. Four five-digit numbers separated by spaces. |
| date eu | Validates date only, not time. dd/mm/yyyy |
| date inverse | Validates date only, not time. yyyy/mm/dd |
| date iso 8601 | Validates date and/or time according to ISO 8601 format. Example: 2021-04-28T09:00:00+02:00 |
| date us | Validates date only, not time. mm/dd/yyyy |
| decimal | Number uses a , or . as a decimal indicator. |
| decimal comma | Number uses , as a decimal indicator. |
| decimal point | Number uses . as a decimal indicator. |
name@domain.extension |
|
| integer | Number is whole. |
| ip address / ipv4 address | Four whole numbers separated by . |
| ipv6 address | Eight values separated by : |
| money | A money pattern with currency symbol + decimal point or comma + currency abbreviation. |
| money comma | A money pattern with currency symbol + decimal comma + currency abbreviation. |
| money point | A money pattern with currency symbol + decimal point + currency abbreviation. |
| negative decimal | Negative number uses a , or . as a decimal indicator. |
| negative decimal comma | Negative number uses , as decimal indicator. |
| negative decimal point | Negative number uses . as decimal indicator. |
| negative integer | Number is negative and whole. |
| negative percentage | Negative number is a percentage. |
| negative percentage comma | Negative number is a percentage with a , decimal indicator. |
| negative percentage point | Negative number is a percentage with a . decimal indicator. |
| percentage | Number is a percentage. |
| percentage comma | Number is a percentage with a , decimal indicator. |
| percentage point | Number is a percentage with a . decimal indicator. |
| phone number | Examples: +12 123 123 1234 123 123 1234 +1 123-123-1234 +12 123-123-1234 +12 123 123-1234 555-2368 555-ABCD |
| positive decimal | Positive number uses a , or . as a decimal indicator. |
| positive decimal comma | Positive number uses , as decimal indicator. |
| positive decimal point | Positive number uses . as decimal indicator. |
| positive integer | Number is positive and whole. |
| positive percentage | Positive number is a percentage. |
| positive percentage comma | Positive number is a percentage with a , decimal indicator. |
| positive percentage point | Positive number is a percentage with a . decimal indicator. |
| time 12h | Validates against the 12-hour clock. hh:mm:ss |
| time 12h nosec | Validates against the 12-hour clock. hh:mm |
| time 24h | Validates against the 24-hour clock. hh:mm:ss |
| time 24h nosec | Validates against the 24-hour clock. hh:mm |
| timestamp 12h | Validates against the 12-hour clock. hh:mm:ss |
| timestamp 24h | Validates against the 24-hour clock. hh:mm:ss |
| uuid | Universally unique identifier. |
Incorporating these validity checks into workflows enables proactive monitoring and maintenance of dataset correctness, ensuring compliance with organizational data quality standards.