Schema checks¶
Maintaining a consistent and accurate schema is essential for data integrity and the reliability of downstream processes. In Soda, schema checks can be defined to monitor and enforce schema consistency. Below are explanations and sample configurations for various types of schema checks.
1. Check for required columns: Verify that specific columns are present in the dataset, ensuring it meets business or operational requirements. This check helps confirm that essential data is included for processing and analysis.
In this example, the check issues a warning if either customer_name
or premise_code
columns are missing, indicating potential issues in data completeness.
- schema:
name: Ensure essential columns are present
warn:
when required column missing: [customer_name, premise_code]
attributes:
category: Schema
title: Essential columns should be present for meaningful analytics
2. Check for forbidden columns: This check identifies and flags columns that should not exist in the dataset, typically due to being deprecated, sensitive, or irrelevant. It helps maintain data quality by ensuring that unwanted or insecure columns are not included in the dataset.
The following check fails if credit_card_number
or social_security_number
columns are found, helping to enforce data governance policies.
- schema:
name: Detect forbidden columns
fail:
when forbidden column present: [credit_card_number, social_security_number]
attributes:
category: Schema
title: Forbidden columns should not be present to ensure data compliance
3. Check for correct column data types: This check ensures that each column in the dataset adheres to the expected data type. Verifying the correct data types helps prevent errors in data processing, improves data consistency, and ensures accurate analysis and reporting.
The following check issues a warning if site_number
is not an integer or premise_code
is not a string, ensuring data type consistency.
- schema:
name: Validate column data types
warn:
when wrong column type:
site_number: integer
premise_code: string
attributes:
category: Schema
title: Columns should have appropriate data types to prevent processing errors
4. Check for column order: This check verifies that columns are arranged in a specified order, which might be required for certain applications or data processing tasks.
The following check issues a warning if customer_name
is not the first column or premise_code
is not the second, maintaining the desired column sequence.
- schema:
name: Verify column order
warn:
when wrong column index:
customer_name: 1
premise_code: 2
attributes:
category: Schema
title: Columns should appear in the specified order
5. Check for column name patterns: This check enables the identification of columns whose names match specific patterns. It is useful for detecting columns that may be deprecated, sensitive, or follow a particular naming convention. By applying name patterns, one can ensure compliance with naming standards or flag columns that require attention due to their naming structure.
The following check issues a warning if any column names end with _temp
or start with old_
, indicating potential issues with outdated or temporary columns.
- schema:
name: Detect columns with specific name patterns
warn:
when forbidden column present: ['%_temp', 'old_%']
attributes:
category: Schema
title: Columns with deprecated name patterns should not be present
6. Check for column presence with wildcards: This check allows to verify if columns matching specific patterns or partial names exist in the dataset, using wildcard characters. It is particularly useful when column names vary but follow a common naming convention.
The following check issues a warning if columns ending with _id
or starting with address_
are missing, ensuring that key columns are included.
- schema:
name: Ensure presence of columns matching patterns
warn:
when required column missing: ['%_id', 'address_%']
attributes:
category: Schema
title: Columns matching specified patterns should be present
Incorporating these schema checks into workflows enables proactive monitoring and maintenance of the structural integrity of datasets, ensuring alignment with organizational data quality standards.