Scan Metadata without Depot¶
!!! info(Information) This guide will demonstrate the steps to perform metadata scanning from external data stored in structured data sources (RDBMS/Data Warehouses) without using Depot. In this case, you must set up a source connection within the Scanner workflow and provide configuration to capture schema details.
Quick Steps¶
Follow the below steps:
For illustration purposes, we will connect with the BigQuery data source.
Step 1: Check Required Permissions¶
-
You should have enough access to fetch the required metadata from the metadata source. The following list describes the minimum permissions needed for BigQuery.
GCP Permission GCP Role Required For 1 bigquery.datasets.get BigQuery Data Viewer Metadata Ingestion 2 bigquery.tables.get BigQuery Data Viewer Metadata Ingestion 3 bigquery.tables.getData BigQuery Data Viewer Metadata Ingestion 4 bigquery.tables.list BigQuery Data Viewer Metadata Ingestion 5 resourcemanager.projects.get BigQuery Data Viewer Metadata Ingestion 6 bigquery.jobs.create BigQuery Job User Metadata Ingestion 7 bigquery.jobs.listAll BigQuery Job User Metadata Ingestion 8 datacatalog.taxonomies.get BigQuery Policy Admin Fetch Policy Tags 9 datacatalog.taxonomies.list BigQuery Policy Admin Fetch Policy Tags 10 bigquery.readsessions.create BigQuery Admin Bigquery Usage Workflow 11 bigquery.readsessions.getData BigQuery Admin Bigquery Usage Workflow -
To run the scanner workflow, a user must have either Metis admin access or be granted access to the "Run as Scanner User" use case.
Step 2: Create Scanner Workflow¶
Let us create a Scanner workflow with sourceConnection
and sourceConfig
sections, where we can provide connection information for the metadata source.
- Provide the workflow properties, such as version, name, description, tags, etc.
-
Define the Scanner job configuration properties in the dag, such as job name, its description, stack used, etc.
version: v1 name: bigquery-scanner-test type: workflow tags: - bigquery-non-depot-scanner description: The workflow is for the job to scan schema tables and register metadata workflow: dag: - name: bigquery-scanner2 description: The job scans schema from bigquery via Non-depot tables and registers metadata to metis2 spec: tags: - scanner2.0 stack: scanner:2.0 compute: runnable-default
-
Under the ‘Scanner’ stack, provide the following:
-
type: This depends on the underlying data source. Values for type could be snowflake, bigquery, redshift, etc.
-
source: Here, you need to explicitly provide the source name where the scanned metadata is saved within Metastore (Metis). On Metis UI, sources are listed for databases, messaging, dashboards, workflows, ML models, etc.
-
-
When the metadata source is not referenced by the Depot, you need to provide the
source connection
details and credentials explicitly. These details will depend on the underlying data source to be scanned. Click here to learn more about the source specific configuration properties.This requires us to provide the following information for our example, all provided by BigQuery:
type
,projected
,privateKey
,privateKeyId
,clientEmail
,clientId
,authUri
,tokenUri
,authProviderX509CertUrl
,clientX509CertUrl
These details are sent to the metadata source while making the connection.
sourceConnection: config: type: BigQuery credentials: gcsConfig: type: service_account projectId: ${{project_id}} privateKeyId: ${{provate_keyID}} privateKey: | | # divide every \n to new line between BEGIN & END PRIVATE KEY -----BEGIN PRIVATE KEY----- xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx -----END PRIVATE KEY----- clientEmail: ${{client_mail}} clientId: ${{clientID}} # authUri: https://accounts.google.com/o/oauth2/auth (default) # tokenUri: https://oauth2.googleapis.com/token (default) # authProviderX509CertUrl: https://www.googleapis.com/oauth2/v1/certs (default) clientX509CertUrl: "https://www.googleapi.com/robot/v1/metadata/x509/ds-demo-write%40dataos-ck-res-yak-dev.iam.gserviceaccount.com" sourceConfig:
-
Provide a set of configurations specific to the source type under
source configuration
to customize and control metadata scanning.- type:
DatabaseMetadata
- filterPattern: Specify which databases/ schemas/ tables/ charts/ dashboards/ topics are of interest.
includes:
andexcludes:
are used to specify schema/table names or a regex rule to include/exclude tables while scanning the schema and registering with Metis.To know more about how to specify filters in different scenarios, refer to Filter Pattern Examples.sourceConfig: config: databaseFilterPattern: includes: - dataos-ck-res-yak-dev schemaFilterPattern: excludes: - manufacturing
Here is the complete YAML code.
version: v1 name: bigquery-scanner-test type: workflow tags: - bigquery-non-depot-scanner description: The workflow is to scan schema tables and register metadata workflow: dag: - name: bigquery-scanner2 description: The job scans schema from bigquery via Non-depot tables and registers metadata to metis2 spec: tags: - scanner2.0 stack: scanner:2.0 compute: runnable-default stackSpec: type: bigquery source: BigQueryTestSource sourceConnection: config: type: BigQuery credentials: gcsConfig: type: service_account projectId: dataos-ck-res-yak-dev privateKeyId: ${{project_id}} privateKey: | -----BEGIN PRIVATE KEY----- ${{private_key}} -----END PRIVATE KEY----- clientEmail: ${{client_mail}} clientId: ${{client_id}} authUri: https://accounts.google.com/o/oauth2/auth (default) tokenUri: https://oauth2.googleapis.com/token (default) authProviderX509CertUrl: https://www.googleapis.com/oauth2/v1/certs (default) clientX509CertUrl: "https://www.googleapi.com/robot/v11/metadata/x509/ds-demo-writer%40dataos-ck-res-yak-dev.iam.gserviceaccount.com" sourceConfig: config: databaseFilterPattern: includes: - dataos-ck-res-yak-dev schemaFilterPattern: excludes: - manufacturing
- type:
Step 3: Check Scanned Metadata on Metis¶
After the successful workflow run, scanned metadata is registered with Metis. You can check the metadata of scanned databases/schemas/tables on Metis UI. To see it, go to the Metis app -> Settings -> Databases.
The following metadata source is created (as configured in the Scanner Workflow).
Scanned Database
Scanned Schemas
Scheduling Scanner Workflow Run¶
Scanner workflows are either single-time run or scheduled to run at a specific cadence. To schedule a workflow, you must add the schedule property defining a cron in workflow
section.
workflow:
title: scheduled Scanner Workflow
schedule:
cron: '*/2 * * * *' #every 2 minute [Minute, Hour, day of the month ,month, dayoftheweek]
concurrencyPolicy: Allow #forbid/replace
endOn: 2024-11-01T23:40:45Z
timezone: Asia/Kolkata