Schema checks¶
Maintaining a consistent and accurate schema is essential for data integrity and the reliability of downstream processes. In Soda, schema checks can be defined to monitor and enforce schema consistency. Below are explanations and sample configurations for various types of schema checks.
Check for required columns: Verify that specific columns are present in the dataset, ensuring it meets business or operational requirements. This check helps confirm that essential data is included for processing and analysis.
In this example, the check issues a warning if either customer_name or premise_code columns are missing, indicating potential issues in data completeness.
- schema:
name: Ensure essential columns are present
warn:
when required column missing: [customer_name, premise_code]
attributes:
category: Schema
title: Essential columns should be present for meaningful analytics
Check for forbidden columns: This check identifies and flags columns that should not exist in the dataset, typically due to being deprecated, sensitive, or irrelevant. It helps maintain data quality by ensuring that unwanted or insecure columns are not included in the dataset.
The following check fails if credit_card_number or social_security_number columns are found, helping to enforce data governance policies.
- schema:
name: Detect forbidden columns
fail:
when forbidden column present: [credit_card_number, social_security_number]
attributes:
category: Schema
title: Forbidden columns should not be present to ensure data compliance
Check for correct column data types: This check ensures that each column in the dataset adheres to the expected data type. Verifying the correct data types helps prevent errors in data processing, improves data consistency, and ensures accurate analysis and reporting.
The following check issues a warning if site_number is not an integer or premise_code is not a string, ensuring data type consistency.
- schema:
name: Validate column data types
warn:
when wrong column type:
site_number: integer
premise_code: string
attributes:
category: Schema
title: Columns should have appropriate data types to prevent processing errors
Check for column order: This check verifies that columns are arranged in a specified order, which might be required for certain applications or data processing tasks.
The following check issues a warning if customer_name is not the first column or premise_code is not the second, maintaining the desired column sequence.
- schema:
name: Verify column order
warn:
when wrong column index:
customer_name: 1
premise_code: 2
attributes:
category: Schema
title: Columns should appear in the specified order
Check for column name patterns: This check enables the identification of columns whose names match specific patterns. It is useful for detecting columns that may be deprecated, sensitive, or follow a particular naming convention. By applying name patterns, one can ensure compliance with naming standards or flag columns that require attention due to their naming structure.
The following check issues a warning if any column names end with _temp or start with old_, indicating potential issues with outdated or temporary columns.
- schema:
name: Detect columns with specific name patterns
warn:
when forbidden column present: ['%_temp', 'old_%']
attributes:
category: Schema
title: Columns with deprecated name patterns should not be present
Check for column presence with wildcards: This check allows to verify if columns matching specific patterns or partial names exist in the dataset, using wildcard characters. It is particularly useful when column names vary but follow a common naming convention.
The following check issues a warning if columns ending with _id or starting with address_ are missing, ensuring that key columns are included.
- schema:
name: Ensure presence of columns matching patterns
warn:
when required column missing: ['%_id', 'address_%']
attributes:
category: Schema
title: Columns matching specified patterns should be present
List of validation keys¶
| Validation key | Values |
|---|---|
| when required column missing | One or more column names in an inline list of comma-separated values, or a nested list. |
| when forbidden column present | One or more column names in an inline list of comma-separated values, or a nested list. |
| when wrong column type | Nested key:value pair to identify column:expected_data_type. |
| when wrong column index | Nested key:value pair to identify column:expected_position_in_dataset_index. |
| when schema changes | any as an inline value.column add as a nested list item.column delete as a nested list item.column index change as a nested list item.column type change as a nested list item. |
Incorporating these schema checks into workflows enables proactive monitoring and maintenance of the structural integrity of datasets, ensuring alignment with organizational data quality standards.