Column Tagging¶
In Flare 4.0, a new feature has been introduced to enable users to tag columns in an output table from the Flare Workflow YAMLs. This new functionality provides immediate governance over tagged columns during dataset writing. To facilitate this, a new columnTags
property has been added to the outputs
section of the Flare YAML.
Three mutually exclusive ways have been defined to identify a column, which includes specifying either columnRegex
, columnName
, or columnNames
within the columnTags
property.
Different Ways to Identify the Column¶
The following section provides a comprehensive overview of these three ways:
columnRegex
¶
The columnRegex
property allows the specification of a regular expression pattern to identify columns that require tagging. The pattern will initiate matching from the beginning of the string.
columnName
¶
The columnName
property is utilized to designate a particular column.
columnTags:
- columnName: "zip_code"
tags:
- PII.Gender
- PHI.DateOfBirth
- columnName: "state_name"
tags:
- PII.Sensitive
- PII.Name
- PII.Name
columnNames
¶
The columnNames
property is utilized when identical tags need to be applied to columns that match the specified name.
columnTags:
- columnNames: # Column Name
- "state_code"
- "zip_code"
- "county_name"
- "state_name"
- "city_id"
- "city_name"
- "version"
tags: # Tags
- PII.None
Code Snippet¶
The code snippet to define column identifiers and tags is provided below
version: v1 # Version
name: columnlevel-tag-workflow # Name of the Workflow
type: workflow # Type of Resource (Here its workflow)
# tags:
# - Platinum
# - PII.Age
# - Tier.Gold
# - XYZ.Workflow
title: Column Level Tagging # Title
description: |
The purpose of this workflow is to provide tags at the column level using Flare.
workflow: # Workflow Section
dag: # Directed Acyclic Graph (DAG)
- name: columnlevel-tag-job # Name of the Job
title: Column Level Tagging Job # Title of the Job
description: |
The purpose of this job is to tag columns at column level.
spec: # Specifications
# tags:
# - System
# - PII.DateOfBirth
# - Tier.Bronze
# - XYZ.Job
stack: flare:5.0 # Stack Version (Here its 4.0)
compute: runnable-default # Compute
stackSpec: # Flare Section
job: # Job Section
explain: true # Explain
logLevel: INFO # Loglevel
showPreviewLines: 2 # Show Preview Lines
inputs: # Inputs Section
- name: a_city_csv # Input Dataset Name
dataset: dataos://thirdparty01:none/city # UDL of the Dataset
format: csv # Input Dataset Format
schemaPath: dataos://thirdparty01:none/schemas/avsc/city.avsc # Schema Path
outputs: # Outputs Section
- name: finalDf # Output Dataset Name
dataset: dataos://icebase:publiccollection/customer_406?acl=rw # Output Dataset UDL
options: # Options
iceberg: # Iceberg Specific Properties Section
partitionSpec:
- column: version
type: identity
properties:
write.format.default: parquet
write.metadata.compression-codec: gzip
sort: # Sort
columns:
- name: version
order: desc
mode: partition
format: iceberg # Format of Output Dataset
# tags:
# - Tier.System
# - Tier.Bronze
# - PII.DateOfBirth
# - XYZ.Dataset
# - Rubber
# - Pencil
title: Column tags
description: Column tags
columnTags: # Column Tag Property
- columnRegex: "city_*" # Column Regular Expression to specify columns to be tagged.
tags: # Tags to be designated
- PII.Sensitive
- PII.Email
- Tier.Bronze
- customer
- columnName: "zip_code" # Name of the column to be tagged
tags: # Tags to be applied to the column
- PII.Gender
- PHI.DateOfBirth
- columnName: "state_name" # Name of the column to be tagged
tags: # Tags to be applied to the column
- PII.Sensitive
- PII.Name
- PII.Name
- columnNames: # Name of the columns to be tagged (multiple)
- "state_name"
- "version"
- "zip_code"
- "zip_code"
tags: # Tags to be applied to the column
- PII.Name
- Tier.Bronze
- Rubber
- Pencil
- PII.*
steps: # Steps Section
- sequence: # Sequence Section
- name: finalDf # Name of the transformation step
sql: SELECT *, date_format(now(), 'yyyyMMddHHmm') AS version FROM a_city_csv LIMIT 10 # sql
functions: # Flare Function
- name: drop
columns:
- "__metadata_dataos_run_mapper_id"