Validity checks¶

Ensuring data validity is crucial for maintaining the integrity and reliability of datasets. In Soda, validity checks can be defined to monitor and enforce data correctness. Below are explanations and sample configurations for various types of validity checks.

Check for invalid values based on a set of valid options: This check verifies that values in the specified column belong to an acceptable set of predefined options.

In the following example, the check verifies that the status column contains only 'active', 'inactive', or 'pending'.

- invalid_count(status) = 0:
    valid values: [active, inactive, pending]
    name: Status should have valid values
    attributes:
      category: Validity
      title: Status column should contain only valid values

Check for invalid values based on a regular expression: Ensure that a column's values match a specific pattern.

The following check ensures that all entries in the email column match the standard email format.

- invalid_count(email) = 0:
    valid regex: '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}$'
    name: Email should have a valid format
    attributes:
      category: Validity
      title: Email column should contain valid email addresses

Check for invalid values based on length constraints: This check ensure that a column's values meet specified length requirements.

Here, the check verifies that the username column contains values with a length between 5 and 15 character.

- invalid_count(username) = 0:
    valid min length: 5
    valid max length: 15
    name: Username should have a valid length
    attributes:
      category: Validity
      title: Username should be between 5 and 15 characters long

Check for invalid values based on numerical ranges: This check ensures that numerical columns have values within a specified range.

The following check ensures that the age column contains values between 18 and 65.

- invalid_count(age) = 0:
    valid min: 18
    valid max: 65
    name: Age should be within the valid range
    attributes:
      category: Validity
      title: Age should be between 18 and 65

Check for invalid values based on format and length constraints: It combines length constraint conditions for comprehensive checks.

The following check ensures that the product_code column matches the specified pattern and has a length of 8 characters.

- invalid_count(product_code) = 0:
    valid regex: '^[A-Z]{3}-\\d{4}$'
    valid min length: 8
    valid max length: 8
    name: Product code should have a valid format and length
    attributes:
      category: Validity
      title: Product code should match the pattern and have a length of 8 characters

Some other Examples

# Check for valid values
- invalid_count(customer_id) = 0:
    invalid regex: ^(?!\d{8}$).+$
    attributes:
      category: Validity
- invalid_count(email_address) = 0:
    valid format: email
    attributes:
      category: Validity
- invalid_percent(english_education) = 0:
    valid length: 100
    attributes:
      category: Validity
- invalid_percent(total_children) <= 2:
    valid max: 6
    attributes:
      category: Validity
- invalid_percent(marital_status) = 0:
    valid max length: 10
    attributes:
      category: Validity
- invalid_count(number_cars_owned) = 0:
    valid min: 1
    attributes:
      category: Validity
- invalid_percent(marital_status) = 0:
    valid min length: 1
    attributes:
      category: Validity
- invalid_count(house_owner_flag) = 0:
    valid values: [0, 1]
    attributes:
      category: Validity



# a check with a fixed threshold
- invalid_count(email_address) = 0:
    valid format: email
    attributes:
      category: Validity
      title: Fixed threshold
# a check with a relative threshold
- invalid_percent(english_education) < 3%:
      valid max length: 100
    attributes:
      category: Validity
      title: Relative threshold

Relative threshold

When Soda scans a column in your dataset, it automatically separates all values in the column into one of three categories:

missing
invalid
valid

Soda then performs two calculations. The sum of the count for all categories in a column is always equal to the total row count for the dataset. missing_count(column_name) + invalid_count(column_name) + valid_count(column_name) = row_count Similarly, a calculation that uses percentage always adds up to a total of 100 for the column. missing_percent(name) + invalid_percent(name) + valid_percent(name) = 100 These calculations enable you to write checks that use relative thresholds.

In the example above, the invalid values of the english_education column must be less than three percent of the total row count, or the check fails.

Percentage thresholds are between 0 and 100, not between 0 and 1.

List of validity metrics¶

Metric	Column config keys	Description	Supported data types
`invalid_count`	`invalid format`, `invalid values`, `valid format`, `valid length`, `valid max`, `valid max length`, `valid min`, `valid min length`, `valid values`	The number of rows in a column that contain values that are not valid.	number, text, time
(same as above)	`invalid regex`, `valid regex`	The number of rows in a column that contain values that are not valid.	text
`invalid_percent`	`invalid format`, `invalid values`, `valid format`, `valid length`, `valid max`, `valid max length`, `valid min`, `valid min length`, `valid values`	The percentage of rows in a column, relative to the total row count, that contain values that are not valid.	number, text, time
(same as above)	`invalid regex`, `valid regex`	The percentage of rows in a column, relative to the total row count, that contain values that are not valid.	text

List of configuration keys¶

The column configuration key:value pair defines what SodaCL ought to consider as valid values.

Column config key	Description	Values
invalid format	Defines the format of a value that Soda ought to register as invalid. Only works with columns that contain data type TEXT.	See List of valid formats.
invalid regex	Specifies a regular expression to define your own custom invalid values.	regex, no forward slash delimiters
invalid values	Specifies the values that Soda ought to consider invalid.	—
valid format	Defines the format of a value that Soda ought to register as valid. Only works with columns that contain data type TEXT.	See List of valid formats.
valid length	Specifies a valid length for a string. Works with columns that contain data type TEXT, and also with INTEGER on most databases, where implicit casting from string to integer is supported. Note: PostgreSQL does not support this behavior, as it does not implicitly cast strings to integers for this use case.	integer
valid max	Specifies a maximum numerical value for valid values.	integer or float
valid max length	Specifies a valid maximum length for a string. Only works with columns that contain data type TEXT.	integer
valid min	Specifies a minimum numerical value for valid values.	integer or float
valid min length	Specifies a valid minimum length for a string. Only works with columns that contain data type TEXT.	integer
valid regex	Specifies a regular expression to define your own custom valid values.	regex, no forward slash delimiters
valid values	Specifies the values that Soda ought to consider valid.	values in a list

List of valid formats¶

Though table below lists valid formats, the same apply for invalid formats.
Valid formats apply only to columns using data type TEXT, not DATE or NUMBER.

Format	Description / Example
credit card number	Four four-digit numbers separated by spaces. Four four-digit numbers separated by dashes. Sixteen-digit number. Four five-digit numbers separated by spaces.
date eu	Validates date only, not time. `dd/mm/yyyy`
date inverse	Validates date only, not time. `yyyy/mm/dd`
date iso 8601	Validates date and/or time according to ISO 8601 format. Example: `2021-04-28T09:00:00+02:00`
date us	Validates date only, not time. `mm/dd/yyyy`
decimal	Number uses a `,` or `.` as a decimal indicator.
decimal comma	Number uses `,` as a decimal indicator.
decimal point	Number uses `.` as a decimal indicator.
email	`name@domain.extension`
integer	Number is whole.
ip address / ipv4 address	Four whole numbers separated by `.`
ipv6 address	Eight values separated by `:`
money	A money pattern with currency symbol + decimal point or comma + currency abbreviation.
money comma	A money pattern with currency symbol + decimal comma + currency abbreviation.
money point	A money pattern with currency symbol + decimal point + currency abbreviation.
negative decimal	Negative number uses a `,` or `.` as a decimal indicator.
negative decimal comma	Negative number uses `,` as decimal indicator.
negative decimal point	Negative number uses `.` as decimal indicator.
negative integer	Number is negative and whole.
negative percentage	Negative number is a percentage.
negative percentage comma	Negative number is a percentage with a `,` decimal indicator.
negative percentage point	Negative number is a percentage with a `.` decimal indicator.
percentage	Number is a percentage.
percentage comma	Number is a percentage with a `,` decimal indicator.
percentage point	Number is a percentage with a `.` decimal indicator.
phone number	Examples: `+12 123 123 1234` `123 123 1234` `+1 123-123-1234` `+12 123-123-1234` `+12 123 123-1234` `555-2368` `555-ABCD`
positive decimal	Positive number uses a `,` or `.` as a decimal indicator.
positive decimal comma	Positive number uses `,` as decimal indicator.
positive decimal point	Positive number uses `.` as decimal indicator.
positive integer	Number is positive and whole.
positive percentage	Positive number is a percentage.
positive percentage comma	Positive number is a percentage with a `,` decimal indicator.
positive percentage point	Positive number is a percentage with a `.` decimal indicator.
time 12h	Validates against the 12-hour clock. `hh:mm:ss`
time 12h nosec	Validates against the 12-hour clock. `hh:mm`
time 24h	Validates against the 24-hour clock. `hh:mm:ss`
time 24h nosec	Validates against the 24-hour clock. `hh:mm`
timestamp 12h	Validates against the 12-hour clock. `hh:mm:ss`
timestamp 24h	Validates against the 24-hour clock. `hh:mm:ss`
uuid	Universally unique identifier.

Incorporating these validity checks into workflows enables proactive monitoring and maintenance of dataset correctness, ensuring compliance with organizational data quality standards.