Flare Job Case Scenarios in DataOS¶
Batch Jobs¶
Batch jobs recompute all affected datasets during each run, ensuring full refresh and deterministic outcomes. They typically involve reading data from source depots, applying transformations, and writing to target depots.
For example Workflows, see the batch job case scenario.
Stream Jobs¶
Stream jobs enable near real-time processing by ingesting data in continuous micro-batches. These jobs are suitable for time-sensitive use cases such as event tracking, system monitoring, and IoT data analysis.
Detailed configuration is available in the streaming job case scenario.
Incremental Jobs¶
Incremental jobs process only the rows or files that have changed since the last execution. This reduces compute cost and latency, making them ideal for frequently updated datasets.
Learn more in the incremental job case scenario.
Data Transformation Use Cases¶
Flare supports several advanced data transformation patterns:
- Perform versioned operations using Iceberg branch read/write.
- Rerun historical data pipelines with data replay.
- Enable parallel job writes using concurrent writes.
- Access data during execution with query dataset for job in progress.
- Apply conditional upserts using merge into functionality.
Job Performance and Optimization¶
Flare jobs can be tuned to enhance execution efficiency and reduce resource usage. Techniques include optimizing transformation logic, adjusting compute configurations, and minimizing I/O.
Refer to job optimization by tuning for implementation guidance.
Metadata and Data Management¶
DataOS supports metadata management to improve discoverability, governance, and reusability:
- Tag columns for semantic classification using column tagging.
- Distribute datasets across environments with data syndication.
Iceberg Table Optimization¶
Efficient handling of data and metadata in Iceberg tables is critical to maintaining performance at scale. Over time, small files and redundant metadata can degrade query efficiency.
Compaction¶
- Reduce the number of small files to improve scan performance using data file compaction.
- Manage metadata overhead with manifest rewrite.
Partitioning¶
- Improve query efficiency through structured data organization with partitioning
- It also helps in improving query efficiency through schema evolution via partition evolution.
Bucketing and Caching¶
Data Lifecycle and Maintenance¶
Maintaining Iceberg datasets involves regular cleanup and space optimization tasks. These actions are supported in DataOS-managed depots (Lakehouse only):
- Remove specific records using delete from dataset.
- Clean up unused metadata with expire snapshots.
- Reclaim storage by deleting untracked files via remove orphans.