Rewrite Dataset¶
DataOS managed depot, Lakehouse built on top of Iceberg format can compact data files in parallel using Flare’s rewrite_dataset
action. This will combine small files into larger files to reduce metadata overhead and runtime file open costs.
Tip
It is recommended to use the Flare 5.0 for rewrite-dataset
action job.
Attribute configuration¶
Attributes | Type | Description |
---|---|---|
strategy |
string | Name of the compaction strategy. Supported values: binpack (default) or sort . |
sort_order |
string | Defines sort order. For Z-ordering, use: zorder(col1,col2) . For regular sorting, use: ColumnName SortDirection NullOrder (e.g., col1 ASC NULLS LAST ). Defaults to the table's current sort order. |
properties |
mapping | Configuration properties to be used during the compaction action. |
where |
string | Predicate string used to filter which files are eligible for rewriting. All files that may contain matching data will be considered. |
Properties¶
Attributes | Default Value | Description |
---|---|---|
max-concurrent-file-group-rewrites | 5 | Maximum number of file groups that can be rewritten in parallel. |
partial-progress.enabled | false | Enables committing file groups incrementally before the full rewrite is finished. Useful for partitions larger than available memory. |
partial-progress.max-commits | 10 | Maximum number of commits allowed when partial progress is enabled. |
target-file-size-bytes | 536870912 | Target output file size in bytes (default is 512 MB). |
rewrite-all | false | Force rewriting of all provided files, overriding other options. |
max-file-group-size-bytes | 107374182400 (100 GB) | Largest amount of data that should be rewritten in a single file group. Helps split very large partitions into smaller, rewritable chunks to avoid cluster resource limits. |
Example Strategies¶
Binpack strategy¶
The binpack strategy is the default behavior in Iceberg when no explicit sorting strategy is provided. This strategy simply combines files without any global sorting or specific field-based sorting.
Rewrite the data files in table lakehouse:retail/pos_store_product_cust
using the default rewrite algorithm of bin-packing to combine small files and also split large files according to the default write size of the table.
version: v1 # Version
name: rewrite # Name of the Workflow
type: workflow # Type of Resource (Here its a workflow)
tags: # Tags
- Rewrite
workflow: # Workflow Specific Section
title: Compress iceberg data files # Title of the DAG
dag: # DAG (Directed Acyclic Graph)
- name: rewrite # Name of the Job
title: Compress iceberg data files # Title of the Job
spec: # Specs
tags: # Tags
- Rewrite
stack: flare:5.0 # Stack Version (Here its Flare stack Version 5.0)
compute: runnable-default # Compute
stackSpec: # Flare Section
job: # Job Section
explain: true # Explain
logLevel: INFO # Loglevel
inputs: # Inputs Section
- name: inputDf # Name of Input Dataset
dataset: dataos://lakehouse:retail/pos_store_product_cust?acl=rw # Dataset UDL
format: Iceberg # Dataset Format
actions: # Flare Action
- name: rewrite_dataset # Name of the action
input: inputDf # Input Dataset Name # defaults to bin packing
Binpack strategy with target-file-size-byte
¶
The target-file-size-bytes property is typically used in conjunction with the binpack strategy in Iceberg.
target-file-size-bytes defines the desired size for the output files when performing a compaction or rewrite operation.
When you perform a binpack operation, the data files are merged or "packed" together, aiming to achieve the target file size. It doesn't perform sorting of data across files but simply combines smaller files into larger ones while aiming to hit the file size defined by target-file-size-bytes.
The following code snippet demonstrates the compression of Iceberg data files for a given input dataset, inputDf
, stored in a DataOS Depot. The compression process aims to reduce the file size to a specified target size in bytes, denoted by the variable target-file-size-bytes
.
version: v1 # Version
name: rewrite # Name of the Workflow
type: workflow # Type of Resource (Here its a workflow)
tags: # Tags
- Rewrite
workflow: # Workflow Specific Section
title: Compress iceberg data files # Title of the DAG
dag: # DAG (Directed Acyclic Graph)
- name: rewrite # Name of the Job
title: Compress iceberg data files # Title of the Job
spec: # Specs
tags: # Tags
- Rewrite
stack: flare:5.0 # Stack Version (Here its Flare stack Version 5.0)
compute: runnable-default # Compute
stackSpec: # Flare Section
job: # Job Section
explain: true # Explain
logLevel: INFO # Loglevel
inputs: # Inputs Section
- name: inputDf # Name of Input Dataset
dataset: dataos://lakehouse:retail/pos_store_product_cust?acl=rw # Dataset UDL
format: Iceberg # Dataset Format
actions: # Flare Action
- name: rewrite_dataset # Name of the action
input: inputDf # Input Dataset Name
options: # Options
properties: # Properties
"target-file-size-bytes": "2500048" # Target File Size in Bytes
Binpack strategy with where
filter condition¶
Since the where clause is specified (id = 3 and name = "foo"), the query will select the files that contain data matching the filter condition (id = 3 and name = "foo"), and only those files will be rewritten.
version: v1 # Version
name: rewrite # Name of the Workflow
type: workflow # Type of Resource (Here its a workflow)
tags: # Tags
- Rewrite
workflow: # Workflow Specific Section
title: Compress iceberg data files # Title of the DAG
dag: # DAG (Directed Acyclic Graph)
- name: rewrite # Name of the Job
title: Compress iceberg data files # Title of the Job
spec: # Specs
tags: # Tags
- Rewrite
stack: flare:5.0 # Stack Version (Here its Flare stack Version 5.0)
compute: runnable-default # Compute
stackSpec: # Flare Section
job: # Job Section
explain: true # Explain
logLevel: INFO # Loglevel
inputs: # Inputs Section
- name: inputDf # Name of Input Dataset
dataset: dataos://lakehouse:retail/pos_store_product_cust?acl=rw # Dataset UDL
format: Iceberg # Dataset Format
actions: # Flare Action
- name: rewrite_dataset # Name of the action
input: inputDf # Input Dataset Name #defaults to bin packing ? check the file zie to verify if file is 512 MB.
options: # Options
where: 'id = 3 and name = "foo"'
Regular sorting¶
Rewrite the data files in table lakehouse:retail/pos_store_product_cust
by sorting all the data on id and name using the same defaults as bin-pack to determine which files to rewrite.
Provide the following information in right order (ColumnName, Order, NUllOrder):
-
ColumnName: The column to sort by.
-
SortDirection: Can be either
ASC
(ascending) orDESC
(descending). -
NullOrder: Specifies how to handle null values:
-
NULLS FIRST
: This places all rows with NULL values at the beginning of the sorted data. -
NULLS LAST
: This places all rows with NULL values at the end of the sorted data.
-
version: v1 # Version
name: rewrite # Name of the Workflow
type: workflow # Type of Resource (Here its a workflow)
tags: # Tags
- Rewrite
workflow: # Workflow Specific Section
title: Compress iceberg data files # Title of the DAG
dag: # DAG (Directed Acyclic Graph)
- name: rewrite # Name of the Job
title: Compress iceberg data files # Title of the Job
spec: # Specs
tags: # Tags
- Rewrite
stack: flare:5.0 # Stack Version (Here its Flare stack Version 5.0)
compute: runnable-default # Compute
stackSpec: # Flare Section
job: # Job Section
explain: true # Explain
logLevel: INFO # Loglevel
inputs: # Inputs Section
- name: inputDf # Name of Input Dataset
dataset: dataos://lakehouse:retail/pos_store_product_cust?acl=rw # Dataset UDL
format: Iceberg # Dataset Format
actions: # Flare Action
- name: rewrite_dataset # Name of the action
input: inputDf # Input Dataset Name #defaults to bin packing ? check the file zie to verify if file is 512 MB.
options: # Options
strategy: sort
sort_order: (id, ASC NULLS FIRST). #(id, Order NullOrder).
Z-Order sorting¶
Z-ordering sorts data based on multiple columns (e.g., product_type
and store_location
) to optimize query performance by clustering related data together. This reduces the number of files scanned during queries.
In a retail dataset (lakehouse:retail/pos_store_product_cust
), Z-ordering by product_type and store_location helps efficiently query sales data for specific combinations of these fields.
version: v1 # Version
name: rewrite # Name of the Workflow
type: workflow # Type of Resource (Here its a workflow)
tags: # Tags
- Rewrite
workflow: # Workflow Specific Section
title: Compress iceberg data files # Title of the DAG
dag: # DAG (Directed Acyclic Graph)
- name: rewrite # Name of the Job
title: Compress iceberg data files # Title of the Job
spec: # Specs
tags: # Tags
- Rewrite
stack: flare:5.0 # Stack Version (Here its Flare stack Version 5.0)
compute: runnable-default # Compute
stackSpec: # Flare Section
job: # Job Section
explain: true # Explain
logLevel: INFO # Loglevel
inputs: # Inputs Section
- name: inputDf # Name of Input Dataset
dataset: dataos://lakehouse:retail/pos_store_product_cust?acl=rw # Dataset UDL
format: Iceberg # Dataset Format
actions: # Flare Action
- name: rewrite_dataset # Name of the action
input: inputDf # Input Dataset Name #defaults to bin packing ? check the file zie to verify if file is 512 MB.
options: # Options
strategy: sort
sortOrder: zorder(product_type, store_location)
Pros and Cons of using different strategies¶
Strategy | What it does | Pros | Cons |
---|---|---|---|
Binpack | Combines files only; no global sorting (will do local sorting within tasks) | This offers the fastest compaction jobs. | Data is not clustered. |
Sort | Sorts by one or more fields sequentially prior to allocating tasks (e.g., sort by field a, then within that, sort by field b) | Data clustered by often queried fields can lead to much faster read times. | This results in longer compaction jobs compared to binpack. |
Z-order | Sorts by multiple fields that are equally weighted, prior to allocating tasks (X and Y values in this range are in one grouping; those in another range are in another grouping) | If queries often rely on filters on multiple fields, this can improve read times even further. | This results in longer-running compaction jobs compared to binpack. |