How to configure a Scanner ?¶

Manifest file attributes in Scanner Workflows¶

The manifest file within a Scanner stack includes attributes designed to facilitate metadata extraction. These attributes specify source configurations, apply filtering criteria, and manage metadata control. Key functionalities include:

Source configuration: Defines the connection details and parameters for the data source.
Filtering: Enables metadata extraction to be limited to specific databases, schemas, tables, or topics.
Metadata Control: Manages the extraction process by identifying and flagging deleted tables and topics.

Syntax for Manifest configuration file:

name: ${{scanner2-snowflake-depot}}
version: v1
type: workflow
tags: 
  - ${{tag1}}
  - ${{tag}}
description: ${{The description of the scanner}}
workflow: 
  dag: 
    - name: ${{scanner2-snowflake-job}}
      description: ${{The job description}}
      tags: 
          - ${{tag}}
      spec: 
        stack: scanner:2.0               
        compute: runnable-default        
        runAsUser: metis                 
        stackSpec: 
          depot: dataos://${{path of the depot}}  #UDL(Uniform Data Link)             
          sourceConfig: 
            config: 
              type: DatabaseMetadata         
              databaseFilterPattern: 
                includes: 
                  - ${{regex}}
                excludes: 
                  - ${{regex}}
              schemaFilterPattern: 
                includes: 
                  - ${{regex}}
                excludes: 
                  - ${{regex}}
              tableFilterPattern: 
                includes: 
                  - ${{regex}}
                excludes: 
                  - ${{regex}}
                markDeletedTables: false   
                includeTags: true
                includeViews: true

Scanner Configuration Attributes details¶

The Scanner Workflow attributes given below provide further details on their roles in metadata extraction:

`schedule`¶

Description: Scanner workflows are either single-time run or scheduled to run at a specific cadence.

Data Type	Requirement	Default Value	Possible Value
string	optional	None	None

Example Usage:

workflow: 
  title: scheduled Scanner Workflow
  schedule:  
    cron: '*/2 * * * *'  #every 2 minute  [Minute, Hour, day of the month ,month, dayoftheweek]
    concurrencyPolicy: Allow #forbid/replace
    endOn: 2024-11-01T23:40:45Z
    timezone: Asia/Kolkata

To schedule a workflow, user must add the schedule property defining a cron in workflow section.

`spec`¶

Description: Specs of the Scanner Workflow

Data Type	Requirement	Default Value	Possible Value
mapping	Mandatory

Example Usage:

spec:
  stack: scanner:2.0

`stack`¶

Description: A Stack is a Resource that serves as a secondary extension point, enhancing the capabilities of a Workflow Resource by introducing additional programming paradigms.

Data Type	Requirement	Default Value	Possible Value
string	Mandatory	None	flare/toolbox/scanner/alpha

Additional Details: You also need to specify specific versions of the stack. If no version is explicitly specified, the system will automatically select the latest version as the default option

Example Usage:

stack: scanner:2.0

`compute`¶

Description: A Compute resource provides processing power for the job.

Data Type	Requirement	Default Value	Possible Value
string	Mandatory	None	runnable-default or any other custom compute created by the user

Example Usage:

compute: runnable-default

`runAsUser`¶

Description: When the "runAsUser" field is configured with the UserID of the use-case assignee, it grants the authority to perform operations on behalf of that user.

Data Type	Requirement	Default Value	Possible Value
string	Mandatory	None	UserID of the Use Case Assignee

Additional information: The default value here is metis. but 'Run as a Scanner user' use case should be granted to run Scanner Workflow. Example Usage:

runAsUser: metis

`depot`¶

Description: Name or address of the Depot. Depot provides a reference to the source from which metadata is read/ingested.

Data Type	Requirement	Default Value	Possible Value
string	Mandatory	None	icebase, redshift_depot, dataos://icebase, etc.

Additional information: The Scanner job will scan all the datasets referred by a Depot. Scanner Workflow will automatically create a source (with the same name as the Depot name) where the scanned metadata is saved within Metastore.

Example Usage:

stackSpec:    
  depot: dataos://icebase

`type`¶

Description: Type of the dataset to be scanned. This depends on the underlying data source.

Data Type	Requirement	Default Value	Possible Value
string	Mandatory	None	snowflake, bigquery, redshift, etc.

Example Usage:

stackSpec:  
  type: snowflake

`sourceConnection`¶

Description: Source connection configuration properties required to connect with the underlying data source to be scanned.

Data Type	Requirement	Default Value	Possible Values
mapping	optional	None	None

`type`¶

Description: Data source type in the sourceConnection section.

Data Type	Requirement	Default Value	Possible Values
string	optional	None	Redshift, Snowflake, Bigquery, etc.

Example Usage:

sourceConnection: 
  config: 
    type: Snowflake

`sourceConfig`¶

Description: Source configuration properties required to control the metadata scan.

Data Type	Requirement	Default Value	Possible Values
mapping	Mandatory	None	None

`type`¶

Description: Specify source config type; This is for type of metadata to be scanned.

Data Type	Requirement	Default Value	Possible Values
string	optional	None	DatabaseMetadata, DashboardMetadata

Additional information: There will be more properties under the 'sourceConfig' section to customize and control metadata scanning.

Example Usage:

sourceConfig: 
  config: 
    type: DatabaseMetadata

`databaseFilterPattern`¶

Description: To determine which databases to include/exclude during metadata ingestion.

Data Type	Requirement	Default Value	Possible Values
mapping	Mandatory	None

Additional information: Applicable in case of databases/warehouses

includes OR excludes

includes: Add an array of regular expressions to this property in the YAML. The Scanner Workflow will include any databases whose names match one or more of the provided regular expressions. All other databases will be excluded.
excludes: Add an array of regular expressions to this property in the YAML. The Scanner Workflow will exclude any databases whose names match one or more of the provided regular expressions. All other databases will be included.

Data Type	Requirement	Default Value	Possible Values
string	Optional	None	Exact values (e.g., 'employee'), regular expressions (e.g., '^sales.*')

Example Usage:

sourceConfig:  
  config:  
    type: DatabaseMetadata
    databaseFilterPattern:
      includes: 
        - TMDCSNOWFLAKEDB

`schemaFilterPattern`¶

Description: To determine which schemas to include/exclude during metadata ingestion.

Data Type	Requirement	Default Value	Possible Values
mapping	Mandatory	None

Additional information: Applicable in case of databases/warehouses

includes OR excludes

includes: Add an array of regular expressions to this property in the YAML. The Scanner Workflow will include any schemas whose names match one or more of the provided regular expressions. All other schemas will be excluded.
excludes: Add an array of regular expressions to this property in the YAML. The Scanner Workflow will exclude any schemas whose names match one or more of the provided regular expressions. All other schemas will be included.

Data Type	Requirement	Default Value	Possible Values
string	Optional	None	Exact values (e.g., 'employee'), regular expressions (e.g., '^sales.*')

Example Usage:

sourceConfig: 
  config: 
    schemaFilterPattern: 
      excludes: 
        - mysql.*
        - information_schema.*
        - ^sys.*

Additional information: Applicable in case of databases/warehouses

`tableFilterPattern`¶

Description: To determine which tables to include/exclude during metadata ingestion.

Data Type	Requirement	Default Value	Possible Values
mapping	Mandatory	None	Exact values (e.g., 'employee'), regular expressions (e.g., '^sales.*')

Additional information: Applicable in case of databases/warehouses

includes OR excludes

includes: Add an array of regular expressions to this property in the YAML to include any tables whose names match one or more of the provided regular expressions. All other tables will be excluded.
excludes: Add an array of regular expressions to this property in the YAML. The Scanner Workflow will exclude any tables whose names match one or more of the provided regular expressions. All other tables will be included.

Data Type	Requirement	Default Value	Possible Values
string	Optional	None	Exact values (e.g., 'employee'), regular expressions (e.g., '^sales.*')

Example Usage:

sourceConfig: 
  config: 
    tableFilterPattern: 
      includes: 
        - ^cust.*

Additional information: Applicable in case of databases/warehouses.

`topicFilterPattern`¶

Description: To determine which topics to include/exclude during metadata ingestion.

Data Type	Requirement	Default Value	Possible Values
mapping	Mandatory	none

Additional information: Applicable in case of stream data.

includes OR excludes

includes: Add an array of regular expressions to this property in the YAML to include any topics whose names match one or more of the provided regular expressions. All other topics will be excluded.
excludes: Add an array of regular expressions to this property in the YAML to exclude any topics whose names match one or more of the provided regular expressions. All other topics will be included.

Data Type	Requirement	Default Value	Possible Values
string	Optional	None	Exact values (e.g., 'employee'), regular expressions (e.g., '^sales.*')

Example Usage:

sourceConfig: 
  config: 
    topicFilterPattern: 
      includes: 
        - ^topic00.*

Info

Filter patterns support Regex in includes and excludes expressions.

`markDeletedTables`¶

Description: Set the Mark Deleted Tables property to true to flag tables as soft-deleted if they are not present anymore in the source system.

Data Type	Requirement	Default Value	Possible Values
boolean	Optional	false	true, false

Additional information: If a dataset is deleted from the source and hasn't been ingested in Metis during a previous Scanner run, there will be no visible change in the scanned metadata on the Metis UI. However, if the deleted dataset has already been ingested in MetisDB from previous Scanner runs, users can run a Scanner Workflow for the specific Depot they want to scan with the markDeletedTables: true option in the Workflow configuration. After a successful run, users can check the Metis UI to see the tables that have been marked as deleted.

Example Usage:

sourceConfig: 
  config: 
    markDeletedTables: false

`markDeletedTablesfromFilterOnly`¶

Description: Set the Mark Deleted Tables property to true to flag tables as soft-deleted if they are not present anymore in the source system.

Data Type	Requirement	Default Value	Possible Values
boolean	Optional	false	true, false

Additional information: Set this property to true to flag tables as soft-deleted if they are not present anymore within the filtered schema or database only. This flag is useful when you have more than one ingestion pipelines.

Example Usage:

sourceConfig: 
  config: 
    markDeletedTablesfromFilterOnly: false

`ingestSampleData`¶

Description: Set this property to true to ingest sample data from the topics.

Data Type	Requirement	Default Value	Possible Values
boolean	Optional	false	true, false

Additional information: Set this property to true to flag tables as soft-deleted if they are not present anymore within the filtered schema or database only. This flag is useful when you have more than one ingestion pipelines.

Example Usage:

sourceConfig: 
  config: 
    ingestSampleData: false

`markDeletedTopics`¶

Description: Set this property to true to flag topics as soft-deleted if they are not present anymore in the source system.

Data Type	Requirement	Default Value	Possible Values
boolean	Optional	false	true, false

Additional information: Set this property to true to flag tables as soft-deleted if they are not present anymore within the filtered schema or database only. This flag is useful when you have more than one ingestion pipelines.

Example Usage:

sourceConfig: 
  config: 
    markDeletedTables: false

`includeViews`¶

Description: Set this property to include views for metadata scanning.

Data Type	Requirement	Default Value	Possible Values
boolean	Optional	false	true, false

Additional information: Set this property to true to flag tables as soft-deleted if they are not present anymore within the filtered schema or database only. This flag is useful when you have more than one ingestion pipelines.

Example Usage:

sourceConfig: 
  config: 
    includeViews: true

`enableDebugLog`¶

Description: To set the default log level to debug.

Data Type	Requirement	Default Value	Possible Values
boolean	Optional	false	true, false

Example Usage:

sourceConfig: 
  config: 
    enableDebugLog: true

How to configure a Scanner ?¶

Manifest file attributes in Scanner Workflows¶

Scanner Configuration Attributes details¶

schedule¶

spec¶

stack¶

compute¶

runAsUser¶

depot¶

type¶

sourceConnection¶

type¶

sourceConfig¶

type¶

databaseFilterPattern¶

schemaFilterPattern¶

tableFilterPattern¶

topicFilterPattern¶

markDeletedTables¶

markDeletedTablesfromFilterOnly¶

ingestSampleData¶

markDeletedTopics¶

includeViews¶

enableDebugLog¶

`schedule`¶

`spec`¶

`stack`¶

`compute`¶

`runAsUser`¶

`depot`¶

`type`¶

`sourceConnection`¶

`type`¶

`sourceConfig`¶

`type`¶

`databaseFilterPattern`¶

`schemaFilterPattern`¶

`tableFilterPattern`¶

`topicFilterPattern`¶

`markDeletedTables`¶

`markDeletedTablesfromFilterOnly`¶

`ingestSampleData`¶

`markDeletedTopics`¶

`includeViews`¶

`enableDebugLog`¶