Data Masking in Nilus¶
Data masking in Nilus enables the protection of sensitive information during data ingestion by replacing or transforming sensitive values with non-sensitive equivalents.
These transformations are applied in-flight, ensuring that source data remains unmodified while preserving its analytical and structural integrity. This process is essential for handling production-grade data in non-production, shared, or compliance-sensitive environments.
Common Use Cases¶
-
Compliance – Satisfies regulatory requirements such as GDPR, CCPA, HIPAA, and PCI DSS by masking sensitive data elements.
-
Security – Enables the use of production-grade data in testing or staging environments without exposing sensitive information.
-
Privacy – Prevents the disclosure of personally identifiable information (PII) and other confidential data.
-
Data Sharing – Facilitates secure access for external teams or partners by masking sensitive content prior to distribution.
Sample Manifest Configuration¶
To apply column-level masking, define a mask section under the source.options block in the Nilus configuration file, as shown below:
source:
address: postgres://user:pass@localhost/db
options:
source-table: "public.users"
mask:
email: hash
phone: partial:3
ssn: redact
salary: round:5000
sink:
address: duckdb://output.duckdb
options:
dest-table: "public.masked_users"
incremental-strategy: append
Attribute Details
Three key attributes must be defined to configure data masking in Nilus. These should be specified in the following format:
| Field | Description |
|---|---|
column_name |
The name of the column to mask (required). |
algorithm |
The masking algorithm to apply (required). |
parameter |
Optional argument for algorithms that support configuration (e.g., partial:2, round:1000). |
Masking Algorithms¶
Nilus supports a comprehensive set of masking algorithms, categorized by functional purpose. Each algorithm enables specific masking behavior to meet varying privacy, security, and compliance requirements.
1. Irreversible Masking¶
Permanently transforms data using one-way functions. Original values cannot be restored.
-
hash/sha256Generates a SHA-256 hash. Produces consistent output for identical input values.
-
md5Produces an MD5 hash. Offers faster processing but lower security than SHA-256.
-
hmacApplies an HMAC using a shared secret key. Enables consistent anonymization across systems.
-
redactReplaces the entire value with the constant string
"REDACTED".
2. Format-Preserving Masking¶
Preserves recognizable data structure or formatting while masking sensitive content.
-
emailMasks characters before the “@”, retaining only the first and last.
-
phoneRetains country and area codes; masks remaining digits.
-
credit_cardReveals only the last four digits.
-
ssnDisplays only the last four digits of U.S. Social Security Numbers.
3. Partial Masking¶
Exposes limited portions of a value, with the remainder masked.
-
partialReveals the first and last N characters, masking the middle.
-
first_letterRetains only the first character, masking the rest.
-
starsReplaces all characters with asterisks of equal length.
-
fixedReplaces the value with a constant placeholder.
4. Tokenization¶
Substitutes sensitive values with generated identifiers to maintain uniqueness or referential integrity.
-
uuidReplaces values with deterministic UUIDs.
-
sequentialAssigns incremental numeric IDs, starting from 1.
-
randomReplaces values with randomly generated values of the same type.
5. Numeric Masking¶
Modifies numeric data while preserving approximate magnitude or distribution.
Warning
These algorithms require explicit type definitions using the type-hints configuration.
-
roundRounds numeric values to the nearest specified multiple.
This algorithm requires the destination column to be of type
TEXT. You need to define the column insidetype-hints. -
rangeMaps numeric values to defined buckets (e.g., income bands).
This algorithm requires the destination column to be of type
TEXT. You need to define the column insidetype-hints. -
noiseApplies a random variation within a defined percentage range.
This algorithm requires the destination column to be of type
DOUBLE. You need to define the column insidetype-hints.
6. Date Masking¶
Transforms date or datetime values while preserving logical time intervals.
Warning
Date masking algorithms require explicit type definitions for columns, specified as TEXT in the type-hints configuration.
-
date_shiftRandomly shifts dates by up to N days.
date_shiftrequires the destination column to be of typeTEXTas shown below insidetype-hints: -
year_onlyRetains only the year component.
year_onlyrequires the destination column to be of typeTEXTas shown below insidetype-hints: -
month_yearPreserves the month and year components.
month_yearrequires the destination column to be of typeTEXTas shown below insidetype-hints:
Applying Masking to Sensitive Columns in an Iceberg Table on AWS-backed DataOS Lakehouse
Example Case Scenario¶
A Nilus pipeline is configured to apply column-level data masking on a source table stored in an Apache Iceberg format within a DataOS Lakehouse environment backed by AWS. The goal is to protect sensitive information before writing it to a downstream sink while preserving analytical usability.
The table below lists of the source table’s schema, which includes a variety of sensitive fields across data types:
| Data Type | Columns |
|---|---|
varchar |
user_id, customer_id, session_id, first_name, last_name, email, phone, ssn, card_number, api_key, password, address, comments, account_number |
bigint |
age, income, salary, score |
double |
revenue, temperature |
date |
birth_date, registration_date, purchase_date |
Sample Manifest Configuration¶
source:
address: dataos://testawslh
options:
source-table: "masking_test.aws_masking_data"
type-hints:
income: text
salary: text
score: text
revenue: double
temperature: double
birth_date: text
registration_date: text
purchase_date: text
mask:
user_id: "hash"
customer_id: "hmac:my-secret-key"
session_id: "md5"
email: "email"
phone: "phone"
ssn: "ssn"
card_number: "credit_card"
api_key: "fixed:MASKED_KEY"
password: "stars"
comments: "redact"
first_name: "first_letter"
last_name: "partial:2"
address: "partial:4"
account_number: "sequential"
age: "round:10"
income: "range:10000"
salary: "round:5000"
score: "range:100"
revenue: "noise:0.1"
temperature: "noise:0.05"
birth_date: "date_shift:30"
registration_date: "year_only"
purchase_date: "month_year"
sink:
address: dataos://mssqldepot5
options:
dest-table: "dbo.aws_masking_mssql9"
incremental-strategy: append
This configuration ensures that:
-
Sensitive fields such as
ssn,card_number, andpasswordare anonymized using irreversible or obfuscating algorithms. -
Format-preserving and partial masking are used where structural integrity is required (e.g.,
email,address). -
Numeric and date values are transformed using rounding, ranges, or temporal shifts to maintain analytical usability while preventing exposure.
Selecting Masking Algorithm¶
The following table outlines recommended algorithms based on common data handling scenarios:
| Scenario | Recommended Masking Algorithms |
|---|---|
| PII Protection | hash, redact, email, phone, ssn |
| Development and Testing | uuid, random, partial |
| Analytical Workloads | round, range, date_shift, month_year |
| Regulatory Compliance | hash, redact, uuid, credit_card |
Performance Considerations¶
-
Hash-based algorithms provide high performance with consistent output.
-
Format-preserving masking incurs moderate CPU overhead due to pattern retention.
-
Randomized masking introduces minimal processing latency.
-
Multiple masking rules can be applied via Nilus in a single processing pass effectively.
Security Considerations¶
-
Hashing supports one-way anonymization; however, common input values may be susceptible to reverse mapping.
-
Partial masking may expose data patterns and is not recommended for highly sensitive fields.
-
Date shifting retains relative intervals and must be used with caution to avoid inference risks.
-
Consistent tokenization methods (e.g.,
hash,uuid) preserve referential integrity but may expose relational patterns across datasets. -
All masking strategies must be validated against organizational data protection and compliance policies.
Processing Behavior¶
-
Masking is performed in-memory during ingestion, without altering the original source data.
-
Processing overhead increases proportionally with dataset size and the number of columns subjected to masking.