Skip to content

Scanner

The Scanner stack in DataOS is a Python-based framework designed for developers to extract metadata from external source systems (such as RDBMS, Data Warehouses, Messaging services, Dashboards, etc.) and the components/services within the DataOS environment to extract information about Data products and DataOS Resources.

With the DataOS Scanner stack, you can extract both general information about datasets/tables, such as their names, owners, and tags, as well as more detailed metadata like table schemas, column names, and descriptions. Additionally, this stack can help you retrieve metadata related to data quality and profiling, query usage, and user information associated with your data assets.

It can also connect with Dashboard and Messaging services to get the related metadata. For example, in the case of dashboards, it extracts information about the dashboard, dashboard Elements, and associated data sources.

Using the Scanner stackwithin DataOS, metadata can be extracted from DataOS Products and DataOS Resources. The extracted metadata offers detailed insights into the input, output & SLOs (Service Level Objectives) for every data product, along with all the data access permissions, infrastructure resources used for creating it and more. Users can track the entire life cycle of data product creation. The Scanner stack collects comprehensive metadata across DataOS Resources such as Workflows, Services, Clusters, Depots, etc.including their historical runtime and operations data.

How Does Scanner Stack Work?

In DataOS, metadata extraction is treated as a job, which is accomplished using a DataOS resource called Workflow. This stack provides the ability to write workflows that extract metadata from various sources and store it in a metadata store. The Scanner workflow typically includes a source, transformations, and a sink.

Similar to an ETL (Extract, Transform, Load) job, the Scanner workflow connects to the metadata source, extracts the metadata, and applies transformations to convert it into a standardized format. The transformed metadata is then pushed to a REST API server, which is backed by a centralized metadata store or database such as MySQL or Postgres. This process can be performed in either a batch or scheduled manner, depending on the requirements.

The stored metadata is used by various DataOS components for discoverability, governance, and observability. External apps running on top of DataOS can also fetch this metadata via Metis server APIs.

Metadata extraction using the Scanner stack

DataOS Scanner stack for metadata extraction

Apart from the external applications, the Scanner stack can also extract metadata from various applications & services of DataOS. The scanner job reads related metadata and pushes it to the metadata store through the Metis REST API server. You can then explore this information through the Metis UI.

The Scanner job connects with the following DataOS components and stores the extracted metadata to Metis DB:

  • Collation Service: To scan and publish metadata related to data pipelines, including workflow information, execution history, and execution states. It also collects metadata for historical data such as pods and logs, as well as data processing stacks like Flare and Benthos, capturing job information and source-destination relationships.
  • Gateway Service: To retrieve information from data profiles (descriptive statistics for datasets) and data quality tables (quality checks for your data along with their pass/fail status). It also scans data related to query usage, enabling insights into heavy datasets, popular datasets, and associations between datasets.
  • Heimdall: To scan and retrieve information about users in the DataOS environment, including their descriptions and profile images. This user information is accessible through the Metis UI.
  • Pulsar Service: To keep listening to the messages being published on it by various other services and stacks within the system.

Creating and Scheduling Scanner Workflows

Within DataOS, different workflows can be deployed and scheduled, which will connect to the data sources to extract metadata.

  • Depot Scan Workflow: With this type of Scanner workflow, depots are used to get connected to the metadata source to extract Entities’ metadata. It enables you to scan all the datasets referred by a depot. You need to provide the depot name or address, which will connect to the data source.

  • Non-Depot Scan Workflow: With this type of scanner workflow, you must provide the connection details and credentials for the underlying metadata source in the YAML file. These connection details depend on the underlying source and may include details such as host URL, project ID, email, etc.

You can write Scanner workflows in the form of a sequential YAML for a pull-based metadata extraction system built into DataOS for a wide variety of sources in your data stack. These workflows can be scheduled to run automatically at a specified frequency.

Scanner YAML

Scanner YAML Components

Learn about the source connection and configuration options to create depot scan/non-depot scan workflow DAGs to scan entity metadata.

Creating Scanner Workflows

Attributes of Scanner Workflow

The below table summarizes various properties within a Scanner workflow YAML.

Attribute Data Type Default Value Possible Value Requirement
spec mapping Mandatory
stack string scanner Mandatory
compute string runnable-default mycompute Mandatory
runAsUser string metis Mandatory
depot string dataos://icebase Mandatory
type string Source-specific bigquery Mandatory
source string bigquery_metasource Mandatory
sourceConnection mapping Mandatory
type string Source-specific BigQuery Mandatory
username string Source-specific projectID email hostport Mandatory
sourceConfig mapping Mandatory
type string DatabaseMetadata MessagingMetadata Mandatory
databaseFilterPattern mapping Mandatory
includes/exclude string ^SNOWFLAKE.* optional
schemaFilterPattern mapping Mandatory
includes/excludes string ^public$ optional
tableFilterPattern mapping mandatory
includes/excludes string ^public$ optional
topicFilterPattern mapping Mandatory
includes/excludes string foo bar optional
includeViews boolean false true false optional
markDeletedTables boolean false true false optional
markDeletedTablesFromFilterOnly boolean false true false optional
enableDebugLog boolean false true false optional
ingestSampleData boolean false true false optional
markDeletedTopics boolean false true false optional

To learn more about these fields, their possible values, example usage, refer to Attributes of Scanner YAML.

Supported Data Sources

Here you can find templates for the depot/non-depot Scanner workflows for the supported data sources.

Type Data Source Scanner
Database Maria DB Link
Database MSSQL Link
Database MYSQL Link
Database Oracle Link
Database PostgreSQL Link
Data Warehouse BigQuery Link
Data Warehouse AzureSQL Link
Data Warehouse Redshift Link
Data Warehouse Snowflake Link
Lakehouse Icebase Link
Messaging Service Kafka Link
Messaging Service Fastbase/Pulsar Link
Dashboard Service Atlas Link
Dashboard Service Redash Link
Dashboard Service Superset Link

System Scanner Workflows

The following workflows are running as system workflows to periodically scan the related metadata and save it to Metis DB to reflect the updated metadata state. They are scheduled to run at a set interval.

Data Products

Data product Scanner workflows is for collecting metadata related to Data products such as inputs, outputs, SLOs, policies, lineage and associated DataOS Resources.

This Scanner workflow reads the metadata and stores it in Metis DB. This metadata helps you understand data product's life cycle along with the data access permissions, infrastructure resources used for creating it.

Scanner for Data Product
version: v1
name: scanner2-data-product
type: workflow
tags:
  - scanner
  - data-quality
description: The job scans schema tables and register data
workflow:
  # schedule:
  #   cron: '*/20 * * * *'
  #   concurrencyPolicy: Forbid
  dag:
    - name: scanner2-data-product-job
      description: The job scans schema from data-product and register data to metis
      spec:
        tags:
          - scanner2
        stack: scanner:2.0
        compute: runnable-default
        stackSpec:
          type: data-product
          sourceConfig:
            config:
              type: DataProduct
              markDeletedDataProducts: true
              # dataProductFilterPattern:
              #   includes:
              #     - customer-360-all$

System Metadata Sync

This scanner periodically scans the icebase and Fastbase and stores metadata related to tables and topics in Metis DB. It collects information from Icebase and Fastbase for the newly added data assets.

Scanner for System Metadata
name: system-metadata-sync
version: v1
type: workflow
tags:
  - icebase
  - fastbase
  - profile
  - scanner
description: Icebase, fastbase metadata scanner workflow
owner: metis
workspace: system
workflow:
  title: Icebase and Fastbase Depot Scanner
  schedule:
    cron: '*/30 * * * *'
    timezone: UTC
    concurrencyPolicy: Forbid
  dag:
    - name: icebase-scanner
      description: The job scans and publishes all datasets from icebase to metis.
      spec:
        stack: scanner:2.0
        logLevel: INFO
        compute: runnable-default
        resources:
          requests:
            cpu: 250m
            memory: 512Mi
          limits: {}
        runAsApiKey: >-
          ****************************************************************************
        runAsUser: metis
        stackSpec:
          depot: icebase
          sourceConfig:
            config:
              markDeletedTables: true
    - name: fastbase-scanner
      description: The job scans and publishes all datasets from fastbase to metis.
      spec:
        stack: scanner:2.0
        logLevel: INFO
        compute: runnable-default
        resources:
          requests:
            cpu: 250m
            memory: 512Mi
          limits:
            cpu: 1100m
            memory: 2048Mi
        runAsApiKey: >-
          ****************************************************************************
        runAsUser: metis
        stackSpec:
          depot: fastbase
stamp: '-bkcu'
generation: 33
uid: a105c747-410c-4edc-9953-bda462e94efd
status: {}

Users’ Information

Heimdall, in DataOS, is the security engine managing user access. It ensures only authorized users can access DataOS resources. This Scanner workflow connects with Heimdall to To scan and retrieve information about users in the DataOS environment, including their descriptions and profile images and stores it in Metis DB.

This Scanner workflow will scan the information about the users in DataOS. This is a scheduled workflow that connects with Heimdall on a given cadence to fetch information about users.

Scanner for User's Information
name: heimdall-users-sync
version: v1
type: workflow
tags:
  - users
  - scanner
description: Heimdall users sync workflow
owner: metis
workspace: system
workflow:
  title: Heimdall Users Sync
  schedule:
    cron: '*/10 * * * *'
    timezone: UTC
    concurrencyPolicy: Forbid
  dag:
    - name: users-sync
      description: The job scans and publishes all users heimdall to metis.
      spec:
        stack: scanner:2.0
        stackSpec:
          type: users
        logLevel: INFO
        compute: runnable-default
        runAsUser: metis
        resources:
          requests:
            cpu: 250m
            memory: 512Mi
          limits:
            cpu: 1100m
            memory: 2048Mi
        runAsApiKey: >-
          ****************************************************************************

Metadata Update

Indexer service, a continuous running service within the DataOS environment keeps track of newly created or updated entities such as Data products, Data Assets(datasets/topics/dashboards, etc.) and DataOS Resources(Workflows, Services, Workers, Monitors, Depots etc.). With this information about the changed entity, it creates a reconciliation Scanner YAML with filters to include only the affected entity. This Scanner workflow will extract the metadata about the entity and update the target metastore.

The following continuous running services are designed for triggering the specific type of metadata scan.

Data Profiling

Flare workflows are run for data profiling on the entire dataset or sample /filtered data and uses basic statistics to know about the validity of the data. This analysis is stored in Icebase.

To learn more about data profiling Flare workflows, click here.

A continuous running service reads about these statistics (metadata extraction related to data profiling) and stores it in Metis DB. This data helps you find your data's completeness, uniqueness, and correctness for the given dataset.

The objective of this worker is to proactively scan data profiling information, which includes descriptive statistics for datasets stored in Icebase. It operates in response to a triggered data profiling job, publishing the metadata to the Metis DB.

Indexer Service for Data Profiling
name: dataset-profiling-indexer
version: v1beta
type: worker
tags:
  - Scanner
description: >-
  The purpose of this worker is to reactively scan workflows and ingest
  profiling data whenever a lifecycle event is triggered.
owner: metis
workspace: system
worker:
  title: Dataset Profiling Indexer
  tags:
    - Scanner
  replicas: 1
  autoScaling:
    enabled: true
    minReplicas: 1
    maxReplicas: 2
    targetMemoryUtilizationPercentage: 120
    targetCPUUtilizationPercentage: 120
  stack: scanner:2.0
  stackSpec:
    type: worker
    worker: data_profile_indexer
  logLevel: INFO
  compute: runnable-default
  runAsUser: metis
  resources:
    requests:
      cpu: 500m
      memory: 1024Mi
    limits: {}
  runAsApiKey: '****************************************************************************'

Data Quality

Service Level objectives(SLOs) are business-specific validation rules applied to test and evaluate the quality of specific datasets if they are appropriate for the intended purpose. DataOS allows you to define your own assertions with a combination of tests to check the rules.

This worker is a continuous running service, designed to reactively scan datasets and ingest quality checks and metrics data along with their pass/fail status whenever a Flare data quality scan is initiated. The acquired metadata related to data quality is then published to the Metis DB, contributing to a comprehensive understanding of data quality.

Indexer Service for Data Quality
name: dataset-quality-checks-indexer
version: v1beta
type: worker
tags:
  - Scanner
description: >-
  The purpose of this worker is to reactively scan workflows and ingest
  quality checks and metrics data whenever a lifecycle event is triggered.
owner: metis
workspace: system
worker:
  title: Dataset Quality Checks Indexer
  tags:
    - Scanner
  replicas: 1
  autoScaling:
    enabled: true
    minReplicas: 1
    maxReplicas: 2
    targetMemoryUtilizationPercentage: 120
    targetCPUUtilizationPercentage: 120
  logLevel: INFO
  runAsUser: metis
  compute: runnable-default
  stack: scanner:2.0
  stackSpec:
    type: worker
    worker: data_quality_indexer
  resources:
    requests:
      cpu: 500m
      memory: 1024Mi
    limits: {}
  runAsApiKey: '****************************************************************************'

SODA Quality Checks

DataOS now also extends support to other quality check platforms, such as SQL-powered SODA. It enables you to use the Soda Checks Language (SodaCL) to turn user-defined inputs into aggregated SQL queries. With Soda, you get a far more expressive language to test the quality of your datasets. It allows for complex user-defined checks with granular control over fail and warn states.

The primary objective of this continuous running service (worker) is to reactively scan datasets, for collecting quality checks and metrics data along with their pass/fail status whenever a SODA quality scan is triggered. The collected data is saved to the Metis DB, facilitating thorough analysis and monitoring.

Indexer Service for SODA Quality Checks
name: soda-quality-checks-indexer
version: v1beta
type: worker
tags:
  - Scanner
description: >-
  The purpose of this worker is to reactively scan datasets and ingest quality
  checks and metrics data whenever a soda scan is triggered.
owner: metis
workspace: system
worker:
  title: Soda Quality Checks Indexer
  tags:
    - Scanner
  replicas: 1
  autoScaling:
    enabled: true
    minReplicas: 1
    maxReplicas: 2
    targetMemoryUtilizationPercentage: 120
    targetCPUUtilizationPercentage: 120
  stack: scanner:2.0
  stackSpec:
    type: worker
    worker: soda_indexer
  logLevel: INFO
  compute: runnable-default
  runAsUser: metis
  resources:
    requests:
      cpu: 500m
      memory: 1024Mi
    limits: {}
  runAsApiKey: '****************************************************************************'

DataOS Resources

DataOS Resources Metadata Scanner Worker is a continuous running service to read the metadata across Workflows, Services, Clusters, Depots, etc., including their historical runtime and operations data, and saves it to the Metis DB a whenever a DataOS Resource is created, deleted within DataOS.

This worker operates reactively to scan specific DataOS Resource information from Poros whenever a lifecycle event is triggered. It captures relevant details and publishes them to the Metis DB, ensuring an up-to-date repository of DataOS Resources metadata.

Indexer Service for DataOS Resources
name: poros-indexer
version: v1beta
type: worker
tags:
  - Scanner
description: >-
  The purpose of this worker is to reactively scan metadata for DataOS Resources whenever a
  lifecycle event is triggered.
owner: metis
workspace: system
worker:
  title: Workflow Indexer
  tags:
    - Scanner
  replicas: 2
  autoScaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 3
    targetMemoryUtilizationPercentage: 120
    targetCPUUtilizationPercentage: 120
  stack: scanner:2.0
  stackSpec:
    type: worker
    worker: poros_indexer
  logLevel: INFO
  compute: runnable-default
  runAsUser: metis
  resources:
    requests:
      cpu: 500m
      memory: 1024Mi
    limits: {}
  runAsApiKey: '****************************************************************************'

Query Usage

DataOS Gateway service backed by Gateway DB keeps query usage data, and dataset usage analytics and harvests required insights such as (heavy datasets, popular datasets, datasets most associated together, etc.). The Scanner Worker is a continuous running service to extract query-related data and saves it to MetaStore. It scans information about queries, users, dates, and completion times. This query history/dataset usage data helps to rank the most used tables.

The following Scanner Worker is for metadata extraction related to query usage data. It reactively reads the queries data published to pulsar topic and ingest it into metis. It scans information about queries, users, dates, and completion times.

Indexer Sertvice for Query Usage
name: query-usage-indexer
version: v1beta
type: worker
tags:
  - Scanner
description: >-
  This worker reactively reads the queries data published to pulsar topic and
  ingest it into metis.
owner: metis
workspace: system
worker:
  title: Query Usage Indexer
  tags:
    - Scanner
  replicas: 1
  autoScaling:
    enabled: true
    minReplicas: 1
    maxReplicas: 2
    targetMemoryUtilizationPercentage: 120
    targetCPUUtilizationPercentage: 120
  stack: scanner:2.0
  stackSpec:
    type: worker
    worker: query_indexer
  logLevel: INFO
  compute: runnable-default
  runAsUser: metis
  resources:
    requests:
      cpu: 500m
      memory: 1024Mi
    limits: {}
  runAsApiKey: '****************************************************************************'

Common Errors

Common Scanner Errors