Data Product Developer Journey¶
Data Product Developers are responsible for creating and managing data products within DataOS. They design, build, and maintain the data infrastructure and pipelines, ensuring the data is accurate, reliable, and accessible.
A Data product developer is typically responsible for the following activities.
Understand Data Needs¶
-
Gather and understand the business requirements and objectives for the data pipeline.
-
Get the required access rights to build and run data movement workflows.
Understanding Data Assets¶
Data developers need to identify and evaluate the data sources to be integrated (databases, APIs, flat files, etc.) and determine each source's data formats, structures, and protocols. DataOS Metis can help you understand these details.
-
Login to DataOS and go to Metis app.
-
Go to Assets tab. Enter a search string to find your data asset quickly.
-
You can also apply filters to find your data asset.
-
Click on the dataset for details such as schema and columns.
-
You can further explore the data asset by looking at its data and writing queries. Metis enables you to open the data asset in Workbench.
Performing Exploratory Data Analysis (EDA)¶
Exploratory Data Analysis is the initial examination of data that helps data product developers understand and make informed decisions. Before diving into building pipelines, or planning transformations, data exploration is crucial for understanding the dataset's features, data types, and distributions. Through queries, it identifies patterns, relationships, anomalies, and outliers within the data.
-
Open the data asset in the Workbench app.
-
Select a Cluster to run your queries.
-
Select catalog, schema, and table.
-
Write and run queries.
Workbench also provides a Studio feature. Whether you're a seasoned SQL pro or just getting started, Studio's intuitive interface will help you craft powerful SQL statements with ease.
Define Data Architecture¶
-
Design the overall data architecture, including data flow, storage, and processing.
-
Develop a detailed plan outlining the transformation steps and processes for building the data pipeline.
-
Provision and configure the necessary infrastructure and computing resources.
Build Pipeline for Data Movement¶
-
Write YAML for the Data Processing Workflow as per the requirement. In the YAML, you need to specify input and output locations as depot addresses—contact the DataOS operator for this information.
-
Specify the DataOS processing Stack you are using for your Workflow.
The following example YAML code demonstrates the various sections of the Workflow.
version: v1
name: wf-sports-test-customer # Workflow name
type: workflow
tags:
- customer
description: Workflow to ingest sports_data customer csv
workflow:
title: customer csv
dag:
- name: sports-test-customer
title: sports_data Dag
description: This job ingests customer csv from Azure blob storage into icebase catalog
tags:
- customer
stack: flare:5.0
compute: runnable-default
stackSpec:
job:
explain: true
inputs:
- name: sports_data_customer
dataset: dataos://azureexternal01:sports_data/customers/
format: CSV
options:
inferSchema : true
logLevel: INFO
steps:
- sequence:
- name: customer # Reading columns from definition files.File names is used to create Survey Ids
sql: >
SELECT *
from sports_data_customer
functions:
- name: cleanse_column_names
# - name: change_column_case
# case: lower
- name: find_and_replace
column: annual_income
sedExpression: "s/[$]//g"
- name: find_and_replace
column: annual_income
sedExpression: "s/,//g"
- name: set_type
columns:
customer_key: int
annual_income: int
total_children: int
- name: any_date
column: birth_date
asColumn: birth_date
outputs:
- name: customer
dataset: dataos://icebase:sports/sample_customer?acl=rw
format: Iceberg
title: sports_data
description: this dataset contains customer csv from sports_data
tags:
- customer
options:
saveMode: overwrite
Run the Workflow¶
-
You need to run this Workflow by using the apply command on DataOS CLI.
-
Using DataOS CLI commands, you can check the runtime information of this Workflow.
-
Once completed, you can check the output data asset on Metis.
Please note that this workflow will be part of the data product bundle.
Click here, to access the comprehensive DataOS Resource specific documentaion on dataos.info.
Explore the rich set of DataOS resources, which serve as the building blocks for data products and processing stacks. Additionally, refer to the learning assets to master the required proficiency level for the role of data product developer.