Skip to content

Recipes of Scanner Stack

This section provides various recipes for executing the Scanner in different scenarios, offering detailed guidance on how to adapt the Scanner's functionality to specific use cases.

Context and Problem Statement

In modern data ecosystems, enterprises require efficient mechanisms to scan, catalog, and manage metadata from various sources. Consider a scenario in which an organization is using Snowflake as a Data source for data storage and analytics. To streamline metadata ingestion, a Scanner Workflow is employed, enabling the organization to configure rules that specify which databases, schemas, and tables should be included or excluded. These configurations leverage regular expressions (regex) to define filter patterns that precisely control metadata ingestion.

Following examples provides guidance on configuring metadata filters in the Scanner Workflow manifest file. It explains how to apply regex-based filtering for databases, schemas, and tables, ensuring efficient metadata scanning.

Example manifest configuration

Let the organization have the following structure:


└─── TEST_SPORTS_RETAIL # DB Name
│   │
│   └─── PUBLIC # Schema Name
│   │   │
│   │   └─── CUSTOMER # Table Name
│   │   │
│   │   └─── CUSTOMER_ADDRESS # Table Name
│   │   │
│   │   └─── CUSTOMER_DEMOGRAPHICS # Table Name
│   │   │
│   │   └─── CALL_CENTER # Table Name
│   │
│   └─── INFORMATION # Schema Name
│       │
│       └─── ORDERS # Table Name
│       │
│       └─── REGION # Table Name
│       │
│       └─── CUSTOMER # Table Name

└─── RETAIL_DB # DB Name
    
    └─── PUBLIC # Schema Name
        
        └─── CUSTOMER_ONLINE # Table Name
        
        └─── CUSTOMER_OFFLINE # Table Name
        
        └─── LOYAL_CUSTOMER # Table Name

So the complete configuration file looks like :

name: scanner2-snowflake-depot  # Name of the scanner workflow
version: v1  # Version of the workflow
type: workflow  # Specifies that this is a workflow
tags:
  - scanner  # Tagging as a scanner workflow
  - snowflake  # Tagging for Snowflake data source
description: The workflow scans Snowflake data source through depot scan  
# Description of the workflow

workflow:
  dag:
    - name: scanner2-snowflake-job  # Name of the scanner job
      description: The job scans schema datasets referred to by Snowflake Depot and registers in Metis2
      tags:
          - scanner2  # Tags related to the scanner job
      spec:
        stack: scanner:2.0  # Specifies the scanner stack version
        compute: runnable-default  # Defines the compute resource to be used for processing
        runAsUser: metis  # Specifies the user with execution privileges
        stackSpec:
          depot: snowflake03  # Name of the depot providing the source metadata
          sourceConfig:
            config:
              type: DatabaseMetadata  # Specifies that the source is a database metadata scan

              databaseFilterPattern:  # Filtering databases for metadata ingestion
                includes:
                  - ^TEST_SPORTS_RETAIL$  # Regex for including TEST_SPORTS_RETAIL database
                  - RETAIL_DB  
  # When User mention name in the filter pattern, Scanner workflow will automatically convert the name to the filter pattern considering it as a prefix so the created regex will be `^RETAIL_DB.*`

              schemaFilterPattern:  # Filtering schemas for metadata ingestion
                excludes:
                  - INFORMATION # Excluding INFORMATION schema 

              tableFilterPattern:  # Filtering tables for metadata ingestion
                includes:
                    - .*CUSTOMER.*    # Including tables which contains CUSTOMER in their name

              markDeletedTables: false  # Do not mark deleted tables as soft-deleted
              includeTags: true  # Include metadata tags in the scan
              includeViews: true  # Include views in the metadata scan

Info

Note that the filter supports regex as includes OR excludes. When a user specify a pattern in the includes section, the Scanner will evaluate which entities match the pattern and include them in the metadata scanning and entities that do not match the pattern will be automatically excluded. The same principle applies to the excludes section, where the Scanner excludes entities that match the specified pattern, while automatically including the rest.

By combining all three filters, user can achieve a hierarchical filtering approach that successively narrows down the scope of the metadata scanning in the Scanner Workflow. This ensures that only the desired databases, schemas, and tables are included in the Workflow based on user specified criteria. If the user do not explicitly specify any of these filters then all available entities are scanned.

Example Scenarios

Let just consider a scenario where user aim to ingest metadata from a Snowflake instance that comprises multiple databases, as shown below. These databases contain various schemas and tables.


└─── SNOWFLAKE # DB Name

└─── SNOWFLAKE_SAMPLE_DATA # DB Name

└─── TEST_SNOWFLAKE_DB # DB Name

└─── TEST_HEALTHCARE # DB Name

└─── TEST_SPORTS_RETAIL # DB Name

└─── TEST_DUMMY_DB # DB Name

└─── RETAIL_DB # DB Name

Database Filters

Use databaseFilterPattern to determine which databases to include/exclude during metadata ingestion.

Example 1:

In this particular example, the objective is to ingest metadata of all databases that include the term "SNOWFLAKE" in their names. To achieve this, the filter pattern .*SNOWFLAKE.* would apply in the includes property. Consequently, this filter pattern will ensure the ingestion of databases such as  SNOWFLAKESNOWFLAKE_SAMPLE_DATA and TEST_SNOWFLAKE_DB.

name: scanner2-snowflake-depot
version: v1
type: workflow
tags:
  - scanner
  - snowflake
description: The workflow scans Snowflake data source through depot scan
workflow:
  dag:
    - name: scanner2-snowflake-job
      description: The job scans schema datasets referred to by Oracle Depot and registers in Metis2
      tags:
          - scanner2
      spec:
        stack: scanner:2.0               
        compute: runnable-default        
        runAsUser: metis                 
        stackSpec:
          depot: dataos://snowflake03
          sourceConfig:
            config:
              databaseFilterPattern:
                includes:
                  - .*SNOWFLAKE.*

Example 2:

If user want to scan only databases that do not start with SNOWFLAKE  then the filter pattern regex applied would be ^SNOWFLAKE.* and they will exclude SNOWFLAKESNOWFLAKE_SAMPLE_DATA.

name: scanner2-snowflake-depot
version: v1
type: workflow
tags:
  - scanner
  - snowflake
description: The workflow scans Snowflake data source through depot scan
workflow:
  dag:
    - name: scanner2-snowflake-job
      description: The job scans schema datasets referred to by Oracle Depot and registers in Metis2
      tags:
          - scanner2
      spec:
        stack: scanner:2.0               
        compute: runnable-default        
        runAsUser: metis                 
        stackSpec:
          depot: dataos://snowflake03
          sourceConfig:
            config:
              databaseFilterPattern:
                excludes:
                  - ^SNOWFLAKE.*

Example 3:

In order to exclusively scan the only database with name RETAIL_DB, the filter pattern would be ^RETAIL_DB$.

name: scanner2-snowflake-depot
version: v1
type: workflow
tags:
  - scanner
  - snowflake
description: The workflow scans Snowflake data source through depot scan
workflow:
  dag:
    - name: scanner2-snowflake-job
      description: The job scans schema datasets referred to by Oracle Depot and registers in Metis2
      tags:
          - scanner2
      spec:
        stack: scanner:2.0               
        compute: runnable-default        
        runAsUser: metis                 
        stackSpec:
          depot: dataos://snowflake03
          sourceConfig:
            config:
              databaseFilterPattern:
                includes:
                  - ^RETAIL_DB$

Example 4:

When a user want to ingest metadata of all databases for which the name starts with TEST OR ends with DB , then the filter pattern applied would be ^TEST & DB$ in the includes property. The scanning process will include databases such as  TEST_SNOWFLAKEDB TEST_HEALTHCARE TEST_SPORTS_RETAIL TEST_DUMMY_DBDUMMY_DB.

name: scanner2-snowflake-depot
version: v1
type: workflow
tags:
  - scanner
  - snowflake
description: The workflow scans Snowflake data source through depot scan
workflow:
  dag:
    - name: scanner2-snowflake-job
      description: The job scans schema datasets referred to by Oracle Depot and registers in Metis2
      tags:
          - scanner2
      spec:
        stack: scanner:2.0               
        compute: runnable-default        
        runAsUser: metis                 
        stackSpec:
          depot: dataos://snowflake03
          sourceConfig:
            config:
              databaseFilterPattern:
                includes: 
                  - ^SNOWFLAKE.*
                  - .*DB$

Schema Filters

Schema filter patterns determine which schemas to include/exclude during metadata ingestion. These are just examples of schemas that could exist in a Snowflake instance, taken for demonstration purpose. The actual schemas and table names may vary based on your specific use case and requirements.

  
  └─── SNOWFLAKE # DB Name
  │   │
  │   └─── PUBLIC # Schema Name
  │   │
  │   └─── TPCH_SF1 # Schema Name
  
  │   └─── TPCH_SF2 # Schema Name
  │   │
  │   └─── INFORMATION_SCHEMA # Schema Name
  
  └─── SNOWFLAKE_SAMPLE_DATA # DB Name
  │   │
  │   └─── PUBLIC # Schema Name
  │   │
  │   └─── INFORMATION_SCHEMA # Schema Name
  │   │
  │   └─── TPCH_SF1 # Schema Name
  │   │
  │   └─── TPCH_SF10 # Schema Name
  │   │
  │   └─── TPCH_SF100 # Schema Name

Example 1:

Let's consider a scenario where user want to scan the metadata from the "public" schema present in all databases. The filter pattern would be public. This will include the schema public present in databases SNOWFLAKE  SNOWFLAKE_SAMPLE_DATA.

name: scanner2-snowflake-depot
version: v1
type: workflow
tags:
  - scanner
  - snowflake
description: The workflow scans Snowflake data source through depot scan
workflow:
  dag:
    - name: scanner2-snowflake-job
      description: The job scans schema datasets referred to by Oracle Depot and registers in Metis2
      tags:
          - scanner2
      spec:
        stack: scanner:2.0               
        compute: runnable-default        
        runAsUser: metis                 
        stackSpec:
          depot: dataos://snowflake03
          sourceConfig:
            config:
              schemaFilterPattern:
                includes: 
                  - public

Info

When user mention name in the filter pattern, Scanner Workflow will automatically convert the name to the filter pattern considering it as a prefix so the created regex will be ^public.*

Example 2:

When user wish to exclude the schema TPCH_SF100 from metadata scanning. As this schema is present only in one database,  excludes property can be used with the pattern ^TPCH_SF100$ .

name: scanner2-snowflake-depot
version: v1
type: workflow
tags:
  - scanner
  - snowflake
description: The workflow scans Snowflake data source through depot scan
workflow:
  dag:
    - name: scanner2-snowflake-job
      description: The job scans schema datasets referred to by Oracle Depot and registers in Metis2
      tags:
          - scanner2
      spec:
        stack: scanner:2.0               
        compute: runnable-default        
        runAsUser: metis                 
        stackSpec:
          depot: dataos://snowflake03
          sourceConfig:
            config:
              schemaFilterPattern:
                excludes:
                  - ^TPCH_SF100$

Example 3:

Suppose user intend to include all schemas that begin with the prefix "TPCH" in all databases. In this case, the appropriate regular expression (regex) to achieve this would be ^TPCH.*. This regex pattern will result in the metadata scan of schemas such as TPCH_SF1 and TPCH_SF2 from the SNOWFLAKE database, as well as schemas like TPCH_SF1TPCH_SF10, and TPCH_SF100 from the SNOWFLAKE_SAMPLE_DATA database.

name: scanner2-snowflake-depot
version: v1
type: workflow
tags:
  - scanner
  - snowflake
description: The workflow scans Snowflake data source through depot scan
workflow:
  dag:
    - name: scanner2-snowflake-job
      description: The job scans schema datasets referred to by Oracle Depot and registers in Metis2
      tags:
          - scanner2
      spec:
        stack: scanner:2.0               
        compute: runnable-default        
        runAsUser: metis                 
        stackSpec:
          depot: dataos://snowflake03
          sourceConfig:
            config:
              schemaFilterPattern:
                includes:
                  - ^TPCH*

Example 4:

Now if a user want to include only the schema TPCH_SF1 present in all the databases but not TPCH_SF100 or TPCH_SF1000.

name: scanner2-snowflake-depot
version: v1
type: workflow
tags:
  - scanner
  - snowflake
description: The workflow scans Snowflake data source through depot scan
workflow:
  dag:
    - name: scanner2-snowflake-job
      description: The job scans schema datasets referred to by Oracle Depot and registers in Metis2
      tags:
          - scanner2
      spec:
        stack: scanner:2.0               
        compute: runnable-default        
        runAsUser: metis                 
        stackSpec:
          depot: dataos://snowflake03
          sourceConfig:
            config:
              schemaFilterPattern:
                includes:
                  - ^TPCH_SF1$     # This will include all the schemas with exact name from all the databases

Table Filter pattern

Use tableFilterPattern to determine which tables to include/exclude during metadata ingestion.


└─── TEST_SPORTS_RETAIL # DB Name
│   │
│   └─── PUBLIC # Schema Name
│   │   │
│   │   └─── CUSTOMER # Table Name
│   │   │
│   │   └─── CUSTOMER_ADDRESS # Table Name
│   │   │
│   │   └─── CUSTOMER_DEMOGRAPHICS # Table Name
│   │   │
│   │   └─── CALL_CENTER # Table Name
│   │
│   └─── INFORMATION # Schema Name
│       │
│       └─── ORDERS # Table Name
│       │
│       └─── REGION # Table Name
│       │
│       └─── CUSTOMER # Table Name

└─── RETAIL_DB # DB Name
    
    └─── PUBLIC # Schema Name
        
        └─── CUSTOMER_ONLINE # Table Name
        
        └─── CUSTOMER_OFFLINE # Table Name
        
        └─── LOYAL_CUSTOMER # Table Name

Example 1:

In this example, the objective is to scan only the table with the name CUSTOMER from all the schemas of TEST_SPORTS_RETAIL database. To achieve this, the filter pattern ^CUSTOMER$ can be used. By applying this pattern during the metadata scanning process, only the tables named CUSTOMER from the TEST_SPORTS_RETAIL.PUBLIC and TEST_SPORTS_RETAIL.INFORMATION schemas will be included.

name: scanner2-snowflake-depot
version: v1
type: workflow
tags:
  - scanner
  - snowflake
description: The workflow scans Snowflake data source through depot scan
workflow:
  dag:
    - name: scanner2-snowflake-job
      description: The job scans schema datasets referred to by Oracle Depot and registers in Metis2
      tags:
          - scanner2
      spec:
        stack: scanner:2.0               
        compute: runnable-default        
        runAsUser: metis                 
        stackSpec:
          depot: dataos://snowflake03
          sourceConfig:
            config:
              databaseFilterPattern:
                includes:
                  - TEST_SPORTS_RETAIL   # The database filter include only the TEST_SPORTS_RETAIL database
              tableFilterPattern:
                includes:
                  - ^CUSTOMER$    # With this filter pattern only the CUSTOMER tables present in the differnt schemas of the TEST_SPORTS_RETAIL are ingested

Example 2:

In this example scenario, user want to scan all the tables having CUSTOMER in their name, filter pattern .*CUSTOMER.* will be used. This will result in scanning of all the tables such as CUSTOMERCUSTOMER_ONLINECUSTOMER_OFFLINELOYAL_CUSTOMER, etc. from the entire database.

name: scanner2-snowflake-depot
version: v1
type: workflow
tags:
  - scanner
  - snowflake
description: The workflow scans Snowflake data source through depot scan
workflow:
  dag:
    - name: scanner2-snowflake-job
      description: The job scans schema datasets referred to by Oracle Depot and registers in Metis2
      tags:
          - scanner2
      spec:
        stack: scanner:2.0               
        compute: runnable-default        
        runAsUser: metis                 
        stackSpec:
          depot: dataos://snowflake03
          sourceConfig:
            config:
              tableFilterPattern:
                includes:
                  - .*CUSTOMER.*
Was this page helpful?