How to build a Data Product?¶
Data Product is build by the Data Product Developers. This section involves the building phase of the Data Product right from data connection to defining SLOs. DataOS provides flexible ways to build the Data Product. You can create the Data Product from scratch or reuse the existing one as per requirements. You can explore the Data Products in the Data Product Hub, an interface that lists all the existing Data Product information.
Let’s see the steps for building a Data Product:
Pre-requisites¶
Before proceeding with the Data Product creation, ensure that you have the necessary tags or use cases to be assigned. To know more about use cases, go to Bifrost.
Steps to create a Data Product¶
From the design phase, it is clear which DataOS Resources we require to build the Data Product, and these are Instance Secret, Depot, Cluster, Scanner, Flare, Policy, SODA Checks, Monitor, Pager, and Bundle. Let’s see how to create each one step by step. As we already created the Depot and ran the Depot scanner, we’ll directly jump into the data transformation step using Flare.
Create the Flare Job for data transformation¶
Flare is a Stack orchestrated by the Workflow that abstracts Apache Spark for large-scale data processing, including ingestion, transformation, enrichment, profiling, quality assessment, and syndication for both batch and streaming data.
Let’s see how you can utilize Flare for various transformations, we are taking the same example of Google Analytics here, to ingest raw data as is from S3, with the only transformation being the conversion of the date column to date_time since it's initially in varchar format.
The code snippet below demonstrates a Workflow involving a single Flare batch job that reads the input dataset from the S3 depot, performs transformation using Flare Stack, and stores the output dataset in the Icebase Depot.
version: v1
name: wf-ga-sessions-daily-raw
type: workflow
tags:
- GA-Sessions-Daily-Data-Raw
description: This job ingests Google Analytics Sessions Daily Data from S3 bucket to DataOS. This workflow will run everyday and will keep appending the raw google analytics data on a daily basis
workflow:
title: GA Sessions Daily Data Raw
dag:
- name: dg-ga-sessions-daily-raw
title: GA Sessions Daily Data Raw
description: This job ingests GA Sessions Daily Data from S3 bucket to DataOS
spec:
tags:
- GA-Sessions-Daily-Data-Raw
stack: flare:5.0
compute: runnable-default
stackSpec:
driver:
coreLimit: 6000m
cores: 2
memory: 8000m
executor:
coreLimit: 6000m
cores: 3
instances: 2
memory: 10000m
job:
explain: true
streaming:
forEachBatchMode: true
triggerMode: Once
checkpointLocation: dataos://icebase:sys01/checkpoints/ga-sessions-daily-data-raw/ga-1001?acl=rw
inputs:
- name: ga_sessions_daily_data
dataset: dataos://s3depot:none/ga_data/
format: parquet
isStream: true
logLevel: INFO
outputs:
- name: ga_sessions_daily_data_raw_v
dataset: dataos://icebase:google_analytics/ga_sessions_daily_data_raw?acl=rw
format: Iceberg
title: GA Sessions Daily Data
description: Ingests GA Sessions Data daily
options:
saveMode: append
sort:
mode: partition
columns:
- name: date_time
order: asc
iceberg:
properties:
write.format.default: parquet
write.metadata.compression-codec: gzip
partitionSpec:
- type: day
column: date_time
name: day
steps:
- sequence:
- name: ga_sessions_daily_data_raw_v
sql: >
select *, cast(to_date(date, 'yyyyMMdd') as timestamp) as date_time
FROM ga_sessions_daily_data
sparkConfig:
- spark.sql.files.maxPartitionBytes: 10MB
- spark.sql.shuffle.partitions: 400
- spark.default.parallelism: 400
- spark.dynamicAllocation.enabled: true
- spark.sql.adaptive.enabled: true
- spark.dynamicAllocation.executorIdleTimeout: 60s
- spark.dynamicAllocation.initialExecutors: 1
- spark.dynamicAllocation.minExecutors: 2
- spark.dynamicAllocation.maxExecutors: 5
- spark.dynamicAllocation.shuffleTracking.enabled: true
- name: dt-ga-sessions-daily-raw
spec:
stack: toolbox
compute: runnable-default
stackSpec:
dataset: dataos://icebase:google_analytics/ga_sessions_daily_data_raw?acl=rw
action:
name: set_version
value: latest
dependencies:
- dg-ga-sessions-daily-raw
Note that the above provided manifest file is just an example of how you can create a Flare job. To know more about the Flare, refer to this.
Create the Monitor for observability¶
The Monitor Resource in DataOS's Observability System triggers incidents based on events or metrics alongside the Pager Resource, enabling comprehensive observability and proactive incident management for high system reliability and performance.
Let’s see how you can set up the Monitor for Workflow status.
Create a manifest file for the Monitor as follows:
name: runtime-monitor
version: v1alpha
type: monitor
tags:
- dataos:type:resource
- dataos:layer:user
description: Attention! workflow run has failed.
layer: user
monitor:
schedule: '*/10 * * * *'
incident:
name: workflowrunning
severity: high
incidentType: workflowruntime
type: report_monitor
report:
source:
dataOsInstance:
path: /collated/api/v1/reports/resources/status?id=workflow:v1:wf-ga-sessions-daily-raw:public
conditions:
- valueComparison:
observationType: status
valueJqFilter: '.value'
operator: equals
value: failed
The above sample configuration demonstrates how to set up the Equation Monitor to raise the incident whenever the Workflow fails.
To know more about Monitor, refer to this.
Create the Pager for managing incidents¶
A Pager in DataOS evaluates incident data against predefined conditions, triggering alerts to specified destinations upon identifying a match, forming the backbone of DataOS Observability, and enabling proactive alerting with the Monitor Resource.
Let’s see how you can set up a Pager to get an alert on MS Teams and Email on the incident raised by the Monitor whenever the Workflow fails.
To create a Pager in DataOS, simply compose a manifest file for a Pager as shown below:
name: wf-failed-pager
version: v1alpha
type: pager
tags:
- dataos:type:resource
- workflow-failed-pager
description: This is for sending Alerts on the Microsoft Teams Channel and Outlook.
workspace: public
pager:
conditions:
- valueJqFilter: .properties.name
operator: equals
value: workflowrunning
output:
email:
emailTargets:
- iam.groot@tmdc.io
- thisis.loki@tmdc.io
msTeams:
webHookUrl: https://rubikdatasolutions.webhook.com/webhookb2/e6b48e18-bdb1-4ffc-98d5-cf4a3819fdb4@2e22bdde-3ec2-43f5-bf92-78e9f35a44fb/IncomingWebhook/20413c453ac149db9
The above configurations demonstrate, how to create a Pager for MS Team and Outlook notifications.
To know more about Pager, refer to this.
Create the Bundle for applying all the Resources¶
With the help of Bundle, users can perform the deployment and management of multiple Resources in a single operation, organized as a flattened DAG with interconnected dependency nodes.
Let’s see, how you can create the Bundle Resource to apply all the resources based on their dependencies. To create a Bundle in DataOS, simply compose a manifest file for a Bundle as shown below:
# Resource meta section
name: dp-bundle
version: v1beta
type: bundle
layer: user
tags:
- dataos:type:resource
description: this bundle resource is for a data product
# Bundle-specific section
bundle:
# Bundle Workspaces section
workspaces:
- name: bundlespace1 # Workspace name (mandatory)
description: this is a bundle workspace # Workspace description (optional)
tags: # Workspace tags (optional)
- bundle
- myworkspace
labels: # Workspace labels (optional)
purpose: testing
layer: user # Workspace layer (mandatory)
# Bundle Resources section
resources:
- id: flare-job
file: /home/office/data_products/flare.yaml
- id: policy
file: /home/office/data_products/policy.yaml
dependencies:
- flare-job
- id: soda checks
file: /home/office/data_products/checks.yaml
dependencies:
- flare-job
- id: monitor
file: /home/office/data_products/monitor.yaml
dependencies:
- flare-job
- id: pager
file: /home/office/data_products/pager.yaml
dependencies:
- flare-job
# Manage As User
manageAsUser: iamgroot
replace the placeholder with the actual values and apply it using the following command on the DataOS Command Line Interface (CLI).
After executing the above command, all the resources will be applied per the dependencies.
To learn more about the Bundle, refer to this.
Create a Policy to secure the data¶
The Policy Resource in DataOS enforces a "never trust, always verify" ethos with ABAC, dynamically evaluating and granting access permissions only if explicitly requested and authorized at each action attempt.
Let’s see how you can create a Data Policy for data masking:
name: email-masking-policy
version: v1
type: policy
tags:
- dataos:type:resource
description: data policy to mask user's email ID
owner: iamgroot
layer: user
policy:
data:
dataset: ga_sessions_daily_data_raw
collection: google_analytics
depot: icebase
priority: 40
type: mask
mask:
operator: hash
hash:
algo: sha256
selector:
column:
tags:
- PII.Email
user:
match: any
tags:
- roles:id:user
The above manifest file illustrates data masking for masking the email ID of users having tag roles:id:user
. Apply the following command on the DataOS Command Line Interface (CLI).
After executing the above command, the Policy will be applied.
To know more about Policy, refer to this.
Now the Data Product is ready. You can explore the Data Product on Data Product Hub.