Skip to content

Scanner

The Scanner stack in DataOS is a Python-based framework designed for developers to extract metadata from external source systems (such as RDBMS, Data Warehouses, Messaging services, Dashboards, etc.) and the components/services within the DataOS environment to extract information about Data products and DataOS Resources.

With the DataOS Scanner stack, you can extract both general information about datasets/tables, such as their names, owners, and tags, as well as more detailed metadata like table schemas, column names, and descriptions. Additionally, this stack can help you retrieve metadata related to data quality and profiling, query usage, and user information associated with your data assets.

It can also connect with Dashboard and Messaging services to get the related metadata. For example, in the case of dashboards, it extracts information about the dashboard, dashboard Elements, and associated data sources.

Using the Scanner stackwithin DataOS, metadata can be extracted from DataOS Products and DataOS Resources. The extracted metadata offers detailed insights into the input, output & SLOs (Service Level Objectives) for every data product, along with all the data access permissions, infrastructure resources used for creating it and more. Users can track the entire life cycle of data product creation. The Scanner stack collects comprehensive metadata across DataOS Resources such as Workflows, Services, Clusters, Depots, etc.including their historical runtime and operations data.

How Does Scanner Stack Work?

In DataOS, metadata extraction is treated as a job, which is accomplished using a DataOS resource called Workflow. This stack provides the ability to write workflows that extract metadata from various sources and store it in a metadata store. The Scanner workflow typically includes a source, transformations, and a sink.

Similar to an ETL (Extract, Transform, Load) job, the Scanner workflow connects to the metadata source, extracts the metadata, and applies transformations to convert it into a standardized format. The transformed metadata is then pushed to a REST API server, which is backed by a centralized metadata store or database such as MySQL or Postgres. This process can be performed in either a batch or scheduled manner, depending on the requirements.

The stored metadata is used by various DataOS components for discoverability, governance, and observability. External apps running on top of DataOS can also fetch this metadata via Metis server APIs.

Metadata extraction using the Scanner stack

DataOS Scanner stack for metadata extraction

Apart from the external applications, the Scanner stack can also extract metadata from various applications & services of DataOS. The scanner job reads related metadata and pushes it to the metadata store through the Metis REST API server. You can then explore this information through the Metis UI.

The Scanner job connects with the following DataOS components and stores the extracted metadata to Metis DB:

  • Collation Service: To scan and publish metadata related to data pipelines, including workflow information, execution history, and execution states. It also collects metadata for historical data such as pods and logs, as well as data processing stacks like Flare and Benthos, capturing job information and source-destination relationships.
  • Gateway Service: To retrieve information from data profiles (descriptive statistics for datasets) and data quality tables (quality checks for your data along with their pass/fail status). It also scans data related to query usage, enabling insights into heavy datasets, popular datasets, and associations between datasets.
  • Heimdall: To scan and retrieve information about users in the DataOS environment, including their descriptions and profile images. This user information is accessible through the Metis UI.
  • Pulsar Service: To keep listening to the messages being published on it by various other services and stacks within the system.

Creating and Scheduling Scanner Workflows

Within DataOS, different workflows can be deployed and scheduled, which will connect to the data sources to extract metadata.

  • Depot Scan Workflow: With this type of Scanner workflow, depots are used to get connected to the metadata source to extract Entities’ metadata. It enables you to scan all the datasets referred by a depot. You need to provide the depot name or address, which will connect to the data source.

  • Non-Depot Scan Workflow: With this type of scanner workflow, you must provide the connection details and credentials for the underlying metadata source in the YAML file. These connection details depend on the underlying source and may include details such as host URL, project ID, email, etc.

You can write Scanner workflows in the form of a sequential YAML for a pull-based metadata extraction system built into DataOS for a wide variety of sources in your data stack. These workflows can be scheduled to run automatically at a specified frequency.

Scanner YAML

Scanner YAML Components

Learn about the source connection and configuration options to create depot scan/non-depot scan workflow DAGs to scan entity metadata.

Creating Scanner Workflows

Attributes of Scanner Workflow

The below table summarizes various properties within a Scanner workflow YAML.

Attribute Data Type Default Value Possible Value Requirement
spec mapping Mandatory
stack string scanner Mandatory
compute string runnable-default mycompute Mandatory
runAsUser string metis Mandatory
depot string dataos://icebase Mandatory
type string Source-specific bigquery Mandatory
source string bigquery_metasource Mandatory
sourceConnection mapping Mandatory
type string Source-specific BigQuery Mandatory
username string Source-specific projectID email hostport Mandatory
sourceConfig mapping Mandatory
type string DatabaseMetadata MessagingMetadata Mandatory
databaseFilterPattern mapping Mandatory
includes/exclude string ^SNOWFLAKE.* optional
schemaFilterPattern mapping Mandatory
includes/excludes string ^public$ optional
tableFilterPattern mapping mandatory
includes/excludes string ^public$ optional
topicFilterPattern mapping Mandatory
includes/excludes string foo bar optional
includeViews boolean false true false optional
markDeletedTables boolean false true false optional
markDeletedTablesFromFilterOnly boolean false true false optional
enableDebugLog boolean false true false optional
ingestSampleData boolean false true false optional
markDeletedTopics boolean false true false optional

To learn more about these fields, their possible values, example usage, refer to Attributes of Scanner YAML.

Supported Data Sources

Here you can find templates for the depot/non-depot Scanner workflows.

Databases and Warehouses

Messaging Services

Dashboard Services

System Scanner Workflows

The following workflows are running as system workflows to periodically scan the related metadata and save it to Metis DB to reflect the updated metadata state. They are scheduled to run at a set interval.

Data Products

The following Scanner workflow collects information about the Data products within DataOS. Scanner for Data Product

System Metadata Sync

The following Scanner workflow collects information from Icebase and Fastbase for the newly added data assets.
Scanner for System Metadata

Users’ Information

This workflow will scan the information about the users in DataOS. This is a scheduled workflow that connects with Heimdall on a given cadence to fetch information about users. Scanner for User's Information

Metadata Update

Indexer service, a continuous running service within the DataOS environment keeps track of newly created or updated entities such as Data products, Data Assets(datasets/topics/dashboards, etc.) and DataOS Resources(Workflows, Services, Workers, Monitors, Depots etc.). With this information about the changed entity, it creates a reconciliation Scanner YAML with filters to include only the affected entity. This Scanner workflow will extract the metadata about the entity and update the target metastore.

The following continuous running services are designed for triggering the specific type of metadata scan.

Data Profiling

The objective of this worker is to proactively scan data profiling information, which includes descriptive statistics for datasets stored in Icebase. It operates in response to a triggered data profiling job, publishing the metadata to the Metis DB.

Indexer Service for Data Profiling

Data Quality

This worker is designed to reactively scan datasets and ingest quality checks and metrics data whenever a data quality scan is initiated. The acquired information is then published to the Metis DB, contributing to a comprehensive understanding of data quality.

Indexer Service for Data Quality

SODA Quality Checks

The primary objective of this worker is to reactively scan datasets, for collecting quality checks and metrics data whenever a SODA quality scan is triggered. The collected data is saved to the Metis DB, facilitating thorough analysis and monitoring.

Indexer Service for SODA Quality Checks

DataOS Resources

This worker operates reactively to scan specific DataOS Resource information from Poros whenever a lifecycle event is triggered. It captures relevant details and publishes them to the Metis DB, ensuring an up-to-date repository of DataOS Resources metadata.

Indexer Service for DataOS Resources

Query Usage

This Worker is for ingesting metadata related to query history. It scans information about queries, users, dates, and completion times.

Indexer Sertvice for Query Usage

Common Errors

Common Scanner Errors