Scanner for Redshift¶

Redshift metadata Scanner Workflow can be configured and scheduled through the DataOS CLI. Ensure that all prerequisites are met before initiating the workflow.

Prerequisites¶

To scan metadata from a Redshift data source, prerequisites are as follows:

Credentials to connect to Redshift database.
SVV_TABLE_INFO View contains summary information for tables in the Redshift database and is visible only to superusers. The permissions to query the view are needed while scanning metadata from Scanner Workflow.
Redshift user must grant SELECT privilege on table SVV_TABLE_INFO to fetch the metadata of tables and views.
Access Permissions in DataOS: To execute a Scanner Workflow in DataOS, verify that at least one of the following role tags is assigned:
- roles:id:data-dev
- roles:id:system-dev
- roles:id:user
Use the following command to check assigned roles:
```
dataos-ctl user get
```
If any required tags are missing, contact a DataOS Operator or submit a Grant Request for role assignment.

Alternatively, if access is managed through use cases, ensure the following use cases are assigned:
- Read Workspace
- Run as Scanner User
- Manage All Depot
- Read All Dataset
- Read All Secrets from Heimdall
To validate assigned use cases Contact DataOS Operator, and to know more refer to the Bifrost Application Use Cases section.

Pre-created Redshift Depot. To check the Depot go to the Metis UI of the DataOS or use the following command:

dataos-ctl get -t depot -a


# Expected output

INFO[0000] 🔍 get...
INFO[0000] 🔍 get...complete

| NAME             | VERSION | TYPE  | WORKSPACE | STATUS | RUNTIME | OWNER      |
| ---------------- | ------- | ----- | --------- | ------ | ------- | ---------- |
| redshiftdepot    | v2alpha | depot |           | active |         | usertest   |

If the Redshift Depot is not created, create a Depot using the following Depot Template:

name: ${{redshift-depot-name}}
version: v2alpha
type: depot
tags:
  - ${{redshift}}
layer: user
description: ${{Redshift Sample data}}
depot:
  type: REDSHIFT
  redshift:
    host: ${{hostname}}
    subprotocol: ${{subprotocol}}
    port: ${{5439}}
    database: ${{sample-database}}
    bucket: ${{tmdc-dataos}}
    relativePath: ${{development/redshift/data_02/}}
  external: ${{true}}
  secrets:
    - name: ${{redshift-instance-secret-name}}-r
      allkeys: true

    - name: ${{redshift-instance-secret-name}}-rw
      allkeys: true

The Depot enables access to all schemas visible to the specified user in the configured database, REDSHIFT.

Scanner Workflow for Redshift¶

The manifest configuration for the Redshift are as follows:

version: v1
name: redshift-scanner2
type: workflow
tags:
  - redshift-scanner2.0
description: The workflow scans schema tables and register metadata
workflow:
  dag:
    - name: scanner2-redshift
      description: The job scans schema from redshift depot tables and register metadata to metis2
      spec:
        stack: scanner:2.0
        compute: runnable-default
        runAsUser: metis
        stackSpec:
          depot: redshiftdepot
          sourceConfig:
            config:
              markDeletedTables: false
              includeTables: true
              includeViews: false
              schemaFilterPattern:
                includes:
                  - ^Public$
              tableFilterPattern:
                includes:
                  - ^Customer.*

Once the credentials of the sample file is given by the user and save with the .yam/.yml extension it can be deployed to extract the meta data by using the following command:

dataos-ctl resource apply -f ${{path to the Scanner YAML}}

Updating the Scanner Workflow:

If the Depot or Scanner configurations are updated, the Scanner must be redeployed after deleting the previous instance. Use the following command to delete the existing Scanner:

  dataos-ctl delete -f ${{path-to-Scanner-YAML}}]

OR

  dataos-ctl delete -t workflow -n ${{name of Scanner}} -w ${{name of workspace}}

Best Practice

As part of best practices, it is recommended to regularly delete Resources that are no longer in use. This practice offers several benefits, including saving time and reducing costs.