Mask Data After Moving from Database to Icebase¶
This article explains the steps to access Sales transaction data in PostgreSQL, a relational database in the DataOS storage. We need to create a Beacon service that enables you to access this database through PostgreSQL REST API. We can also enforce DataOS governance policies for the secure access of the data in PostgreSQL.
Connect with Sample Data¶
In this tutorial, you learn how to create a database in PostgreSQL and through a Beacon service, update it andΒ ingest the sample data in it:
- Create SQL files
- Create Database Resource in DataOS
- Create Beacon Service
- Sample Sales Data Ingestion
Create SQL Files¶
This is to create a table in the PostgreSQL database:
-
Create a migration folder.
-
Create Up and Down SQL files.
a. In up.sql, write the code to create tables, alter tables, and create functions & triggers.
b. In down.sql, we drop, truncate, delete tables, functions & triggers.
-
Test your SQL commands on local Postgre instance. <0001_initial.up.sql> in this case.
Note: Naming of SQL files must be
.up.sql & .down.sql format, name them in sequence when you have to run more than one up or down file for same database for the update. Name every column type according to PostgreSQL data type.
```sql
CREATE TABLE IF NOT EXISTS sales_transactions
(
ProductKey int,
CustomerKey int,
TerritoryKey int,
OrderLineItem int,
OrderQuantity int,
Foreign key (PorductKey,CustomerKey)
);
```
Create Database Resource in DataOS¶
This is to create a database in PostgreSQL:
-
Open the source code editor and create a
database.yaml
file providing the database name and the full path of the migration folder containing SQL files. -
After creating database YAML, run apply command on DataOS CLI to create database resource in DataOS.
Create Beacon Service YAML¶
This service will allow you to access the database to perform CRUD operations. While defining a service, you have to provide inputs for the properties such as Replicas & ingress to ****configure the incoming port for the service to allow access to DataOS resources from external links, destination URL and path, etc.
-
Open the source code editor and create the following YAML file.
version: v1beta1 name: salesdata # service name type: service tags: - service service: replicas: 2 ingress: enabled: true stripPath: true path: /salesdata/api/v1 # how your URL going to appear. noAuthentication: true stack: beacon+rest envs: PGRST_OPENAPI_SERVER_PROXY_URI: https://<dataos-context>/salesdata/api/v1 # environment URL + path: /citybase/api/v1 beacon: source: type: database name: postgresqldbdatabase workspace: public topology: - name: database type: input doc: Sales database connection - name: rest-api type: output doc: serves the database (containing sales data) as a RESTful API dependencies: - database
-
Run the
apply
command on DataOS CLI to create the service resource for the database. -
We can check the API through any browser or with Postman.
Sample Sales Data Ingestion¶
You need to create a workflow to ingest sales data given in JSON format to the Postgres database.
- Open source code editor and create the following
dataingestion.yaml
file providing the source for reading from and destination for writing data. -
Run the
dataingestion.yaml
to ingest sample data to Postgres.version: v1beta1 name: sales-transactions type: workflow tags: - Connect - sales_data description: The job ingests sales transactions data in JSON format into raw zone workflow: title: sales_transactions_data dag: - name: sales title: connect sales transactions data spec: tags: - Connect_jobs - transactions_connect stack: flare:6.0 envs: HERA_URL: "https://enough-kingfish.dataos.app/hera/api" DEPOT_SERVICE_URL: "https://enough-kingfish.dataos.app/ds/api/v2" stackSpec: driver: coreLimit: 2000m cores: 2 memory: 2500m executor: coreLimit: 1500m cores: 2 instances: 1 memory: 2000m job: explain: true inputs: # JSON file - name: sales_tranactions dataset: dataos://thirdparty01:retail/Transactions_data?acl=rw format: JSON logLevel: ERROR outputs: # PostgreSQl Database - name: output01 depot: dataos://postgresqldbdatabase:public?acl=rw driver: org.postgresql.Driver steps: - sink: - sequenceName: sales_transactions datasetName: sales outputName: output01 outputType: JDBC description: This dataset gives you details of all sports sales transactions table and their corresponding attributes. tags: - sales transactions table data - sportsdatasets outputOptions: saveMode: overwrite sequence: - name: sales_transactions_0 sql: select * from sales_tranactions functions : - name: any_date column: OrderDate asColumn: Order_Date - name: any_date column: StockDate asColumn : StockDate - name : drop columns: - OrderDate - name : set_type columns: ProductKey : int CustomerKey : int TerritoryKey : int OrderLineItem : int OrderQuantity : int - name: sales_transactions sql: > select *, row_number() over (order by OrderNumber) as primary_key from sales_transactions_0
-
Run the
apply
command on DataOS CLI.dataos-ctl apply -f /home/arike/Desktop/Everything_Folder/Beacon/ingestion.yaml INFO[0000] π apply... INFO[0001] π§ applying(public) sales-transactions:v1beta1:workflow... π¨ v1beta1 has been replaced by versions v1 for type workflow π¨ please update as v1beta1 will be deprecated in the next release, and auto-migrate will be disabled INFO[0007] π§ applying(public) sales-transactions:v1:workflow...created INFO[0007] π apply...complete arike@arike-ThinkPad-E15-Gen-2:~$ dataos-ctl -t workflow -w public get INFO[0000] π get... INFO[0000] π get...complete NAME | VERSION | TYPE | WORKSPACE | STATUS | RUNTIME | OWNER ---------------------|---------|----------|-----------|---------|---------|------------- sales-transactions | v1 | workflow | public | pending | | arikerawat arike@arike-ThinkPad-E15-Gen-2:~$ dataos-ctl -t workflow -w public -n sales-transactions get runtime -r INFO[0000] π workflow... INFO[0001] π workflow...complete NAME | VERSION | TYPE | WORKSPACE | TITLE | OWNER ---------------------|---------|----------|-----------|-------------------------|------------- sales-transactions | v1 | workflow | public | sales_transactions_data | arikerawat JOB NAME | STACK | JOB TITLE | JOB DEPENDENCIES -----------|------------|--------------------------------|------------------- sales | flare:6.0 | connect sales transactions | | | data | system | dataos_cli | System Runnable Steps | RUNTIME | PROGRESS | STARTED | FINISHED ----------|----------|---------------------------|----------- running | 2/3 | 2022-09-06T12:59:44+05:30 | NODE NAME | JOB NAME | POD NAME | DATA PLANE | TYPE | CONTAINERS | PHASE ---------------------------------|----------|------------------------------------|------------|--------------|-------------------------|------------ sales-execute | sales | sales-transactions-n0q5-3146761932 | hub | pod-workflow | main | running sales-n0q5-0906072944-driver | sales | sales-n0q5-0906072944-driver | hub | pod-flare | spark-kubernetes-driver | running sales-start-rnnbl | sales | sales-transactions-n0q5-1166669586 | hub | pod-workflow | wait,main | succeeded sales-transactions-start-rnnbl | system | sales-transactions-n0q5-2168764226 | hub | pod-workflow | wait,main | succeeded
Register Metadata¶
Run the following Scanner YAML to register the metadata for your data source with Metis. It will capture schema details such as table names, constraints, primary keys, etc., for this external data source. This will allow you to run queries on the data in DataOS.
-
Open the source code editor and create the following YAML file.
version: v1beta1 name: scanner-postgres-db # Name of workflow type: workflow # Type of YAML tags: - scanner - postgres-db description: Scans metadata of postgres database tables and registers them as datasets on Datanet workflow: dag: - name: postgres-db-scanner # Name of the job shown on Runnables(Dag) description: The job scans metadata of postgres database tables and registers them as datasets spec: tags: - postgres-database stack: scanner:2.0 # Name of stack that will process the data stackSpec: depot: dataos://postgresqldbdatabase:public
-
Use the
apply
command to create the resource in DataOS.
Create Access and Mask Policy¶
We are creating access and mask policies in this use case. To apply these policies, you have to revoke default access for the dataset and then allow users having a specific tag to access data and mask the sensitive data.
Note: You need βOperatorβ level permissions to create policy.
-
Open the source code editor and create the following YAML files.
Access Policy
name: access-for-postgresqldb version: v1 type: policy description: Policy to access data from depot postgresqldbdatabase:public for user tag postgresqldb:user layer: user policy: access: subjects: tags: - dataos:u:postgresqldb:user predicates: - read - write - update - delete - create objects: paths: - dataos://postgresqldbdatabase:public/* allow: true
Mask Policy
name: mask-columns version: v1 type: policy layer: user description: "data policy to hash pii columns by name" owner: policy: data: type: mask priority: 99 depot: azuresqldb collection: dbo dataset: customers selector: user: match: any tags: - "dataos:u:user" column: names: - "birthdate" - "emailaddress" mask: operator: hash hash: algo: sha256
-
Use the
apply
command to create the Depot resource in DataOS.