Mask Data After Moving from Database to Icebase¶
This article explains the steps to access Sales transaction data in PostgreSQL, a relational database in the DataOS storage. We need to create a Beacon service that enables you to access this database through PostgreSQL REST API. We can also enforce DataOS governance policies for the secure access of the data in PostgreSQL.
Connect with Sample Data¶
In this tutorial, you learn how to create a database in PostgreSQL and through a Beacon service, update it andΒ ingest the sample data in it:
- Create SQL files
- Create Database Resource in DataOS
- Create Beacon Service
- Sample Sales Data Ingestion
Create SQL Files¶
This is to create a table in the PostgreSQL database:
-
Create a migration folder.
-
Create Up and Down SQL files.
a. In up.sql, write the code to create tables, alter tables, and create functions & triggers.
b. In down.sql, we drop, truncate, delete tables, functions & triggers.
- Test your SQL commands on local Postgre instance. <0001_initial.up.sql> in this case.
Note: Naming of SQL files must be
.up.sql & .down.sql format, name them in sequence when you have to run more than one up or down file for same database for the update. Name every column type according to PostgreSQL data type.
CREATE TABLE IF NOT EXISTS sales_transactions
(
ProductKey int,
CustomerKey int,
TerritoryKey int,
OrderLineItem int,
OrderQuantity int,
Foreign key (PorductKey,CustomerKey)
);
Create Database Resource in DataOS¶
This is to create a database in PostgreSQL:
- Open the source code editor and create a
database.yaml
file providing the database name and the full path of the migration folder containing SQL files.
version: v1beta1
name: postgresqldbdatabase # database name
type: database
description: Sample sales database created
tags:
- sales
database:
migrate:
includes:
- ./Migration # all up & down sql files.
command: up # in case of drop table, write down.
- After creating database YAML, run apply command on DataOS CLI to create database resource in DataOS.
Create Beacon Service YAML¶
This service will allow you to access the database to perform CRUD operations. While defining a service, you have to provide inputs for the properties such as Replicas & ingress to ****configure the incoming port for the service to allow access to DataOS resources from external links, destination URL and path, etc.
- Open the source code editor and create the following YAML file.
version: v1beta1
name: salesdata # service name
type: service
tags:
- service
service:
replicas: 2
ingress:
enabled: true
stripPath: true
path: /salesdata/api/v1 # how your URL going to appear.
noAuthentication: true
stack: beacon+rest
envs:
PGRST_OPENAPI_SERVER_PROXY_URI: https://<dataos-context>/salesdata/api/v1
# environment URL + path: /citybase/api/v1
beacon:
source:
type: database
name: postgresqldbdatabase
workspace: public
topology:
- name: database
type: input
doc: Sales database connection
- name: rest-api
type: output
doc: serves the database (containing sales data) as a RESTful API
dependencies:
- database
-
Run the
apply
command on DataOS CLI to create the service resource for the database. -
We can check the API through any browser or with Postman.
Sample Sales Data Ingestion¶
You need to create a workflow to ingest sales data given in JSON format to the Postgres database.
- Open source code editor and create the following
dataingestion.yaml
file providing the source for reading from and destination for writing data. - Run the
dataingestion.yaml
to ingest sample data to Postgres.
version: v1beta1
name: sales-transactions
type: workflow
tags:
- Connect
- sales_data
description: The job ingests sales transactions data in JSON format into raw zone
workflow:
title: sales_transactions_data
dag:
- name: sales
title: connect sales transactions data
spec:
tags:
- Connect_jobs
- transactions_connect
stack: flare:5.0
envs:
HERA_URL: "https://enough-kingfish.dataos.app/hera/api"
DEPOT_SERVICE_URL: "https://enough-kingfish.dataos.app/ds/api/v2"
stackSpec:
driver:
coreLimit: 2000m
cores: 2
memory: 2500m
executor:
coreLimit: 1500m
cores: 2
instances: 1
memory: 2000m
job:
explain: true
inputs: # JSON file
- name: sales_tranactions
dataset: dataos://thirdparty01:retail/Transactions_data?acl=rw
format: JSON
logLevel: ERROR
outputs: # PostgreSQl Database
- name: output01
depot: dataos://postgresqldbdatabase:public?acl=rw
driver: org.postgresql.Driver
steps:
- sink:
- sequenceName: sales_transactions
datasetName: sales
outputName: output01
outputType: JDBC
description: This dataset gives you details of all sports sales transactions table and their corresponding attributes.
tags:
- sales transactions table data
- sportsdatasets
outputOptions:
saveMode: overwrite
sequence:
- name: sales_transactions_0
sql: select * from sales_tranactions
functions :
- name: any_date
column: OrderDate
asColumn: Order_Date
- name: any_date
column: StockDate
asColumn : StockDate
- name : drop
columns:
- OrderDate
- name : set_type
columns:
ProductKey : int
CustomerKey : int
TerritoryKey : int
OrderLineItem : int
OrderQuantity : int
- name: sales_transactions
sql: >
select *,
row_number() over (order by OrderNumber) as primary_key
from sales_transactions_0
- Run the
apply
command on DataOS CLI.
dataos-ctl apply -f /home/arike/Desktop/Everything_Folder/Beacon/ingestion.yaml
INFO[0000] π apply...
INFO[0001] π§ applying(public) sales-transactions:v1beta1:workflow...
π¨ v1beta1 has been replaced by versions v1 for type workflow π¨
please update as v1beta1 will be deprecated in the next release, and auto-migrate will be disabled
INFO[0007] π§ applying(public) sales-transactions:v1:workflow...created
INFO[0007] π apply...complete
arike@arike-ThinkPad-E15-Gen-2:~$ dataos-ctl -t workflow -w public get
INFO[0000] π get...
INFO[0000] π get...complete
NAME | VERSION | TYPE | WORKSPACE | STATUS | RUNTIME | OWNER
---------------------|---------|----------|-----------|---------|---------|-------------
sales-transactions | v1 | workflow | public | pending | | arikerawat
arike@arike-ThinkPad-E15-Gen-2:~$ dataos-ctl -t workflow -w public -n sales-transactions get runtime -r
INFO[0000] π workflow...
INFO[0001] π workflow...complete
NAME | VERSION | TYPE | WORKSPACE | TITLE | OWNER
---------------------|---------|----------|-----------|-------------------------|-------------
sales-transactions | v1 | workflow | public | sales_transactions_data | arikerawat
JOB NAME | STACK | JOB TITLE | JOB DEPENDENCIES
-----------|------------|--------------------------------|-------------------
sales | flare:5.0 | connect sales transactions |
| | data |
system | dataos_cli | System Runnable Steps |
RUNTIME | PROGRESS | STARTED | FINISHED
----------|----------|---------------------------|-----------
running | 2/3 | 2022-09-06T12:59:44+05:30 |
NODE NAME | JOB NAME | POD NAME | DATA PLANE | TYPE | CONTAINERS | PHASE
---------------------------------|----------|------------------------------------|------------|--------------|-------------------------|------------
sales-execute | sales | sales-transactions-n0q5-3146761932 | hub | pod-workflow | main | running
sales-n0q5-0906072944-driver | sales | sales-n0q5-0906072944-driver | hub | pod-flare | spark-kubernetes-driver | running
sales-start-rnnbl | sales | sales-transactions-n0q5-1166669586 | hub | pod-workflow | wait,main | succeeded
sales-transactions-start-rnnbl | system | sales-transactions-n0q5-2168764226 | hub | pod-workflow | wait,main | succeeded
Register Metadata¶
Run the following Scanner YAML to register the metadata for your data source with Metis. It will capture schema details such as table names, constraints, primary keys, etc., for this external data source. This will allow you to run queries on the data in DataOS.
- Open the source code editor and create the following YAML file.
version: v1beta1
name: scanner-postgres-db # Name of workflow
type: workflow # Type of YAML
tags:
- scanner
- postgres-db
description: Scans metadata of postgres database tables and registers them as datasets on Datanet
workflow:
dag:
- name: postgres-db-scanner # Name of the job shown on Runnables(Dag)
description: The job scans metadata of postgres database tables and registers them as datasets
spec:
tags:
- postgres-database
stack: scanner:2.0 # Name of stack that will process the data
stackSpec:
depot: dataos://postgresqldbdatabase:public
- Use the
apply
command to create the resource in DataOS.
Create Access and Mask Policy¶
We are creating access and mask policies in this use case. To apply these policies, you have to revoke default access for the dataset and then allow users having a specific tag to access data and mask the sensitive data.
Note: You need βOperatorβ level permissions to create policy.
- Open the source code editor and create the following YAML files.
Access Policy
name: access-for-postgresqldb
version: v1
type: policy
description: Policy to access data from depot postgresqldbdatabase:public for user tag postgresqldb:user
layer: user
policy:
access:
subjects:
tags:
- dataos:u:postgresqldb:user
predicates:
- read
- write
- update
- delete
- create
objects:
paths:
- dataos://postgresqldbdatabase:public/*
allow: true
Mask Policy
name: mask-columns
version: v1
type: policy
layer: user
description: "data policy to hash pii columns by name"
owner:
policy:
data:
type: mask
priority: 99
depot: azuresqldb
collection: dbo
dataset: customers
selector:
user:
match: any
tags:
- "dataos:u:user"
column:
names:
- "birthdate"
- "emailaddress"
mask:
operator: hash
hash:
algo: sha256
- Use the
apply
command to create the Depot resource in DataOS.