Skip to content

Expire Snapshots

The expire_snapshots action expires amassed snapshots. Data files are not deleted until they are no longer referenced by a snapshot that may be used for time travel or rollback. Regularly expiring snapshots deletes unused data files.

Important considerations

  • Expiring old snapshots removes them from metadata, so they are no longer available for time travel queries.
  • Data files are not deleted until they are no longer referenced by a snapshot that may be used for time travel or rollback.
  • Regularly expiring snapshots deletes unused data files.

To check the timestamp and list of snapshot add the following command in your terminal:

dataos-ctl dataset snapshots -a dataos://lakehouse:retail/pos_store_product_cust

Expected output

INFO[0000] 📂 get snapshots...                           
INFO[0003] 📂 get snapshots...completed                  

      SNAPSHOTID         TIMESTAMP       DATE AND TIME (GMT)     
──────────────────────┼───────────────┼────────────────────────────
  7177047349072031975  1744366300306  2025-04-11T10:11:40+00:00  
  580493728505961346   1744366356145  2025-04-11T10:12:36+00:00  
  2613385101287075565  1744366357891  2025-04-11T10:12:37+00:00  

Configuration Options for Snapshot Deletion

Attribute Description
olderThanMillis Milliseconds since epoch before which snapshots will be removed. This replaces expireOlderThan as the primary time-based expiration setting.
olderThanTimestamp A human-readable timestamp before which snapshots will be removed (e.g., 2024-12-01T00:00:00Z).
snapshotIds Specifies the list of specific snapshots to expire.
retainLast Number of ancestor snapshots to preserve regardless of olderThanMillis. (Defaults to 1)
maxConcurrentDeletes Size of the thread pool used for delete file actions. (By default, no thread pool is used.)
streamDeleteResults By default, all files to delete are brought to the driver at once, which may cause memory issues with large file lists. Set to true to use toLocalIterator.

olderThanMillis

Info

expireOlderThan is only available in Flare 7.0. Use olderThanMillis in place of expireOlderThan in Flare 7.0.

The olderThanMillis attribute specifies a cutoff timestamp in unix format. Snapshots created before this timestamp are considered expired and will be deleted, along with their associated metadata and manifest files, if no longer referenced. This helps manage storage and keep the table metadata clean by removing historical data no longer needed for rollback or time travel.

name: expire-snapshots01                            # Name of the Workflow
version: v1                                         # Version
type: workflow                                      # Type of Resource (Here its workflow)
tags:                                               # Tags
  - expire
workflow:                                           # Workflow Section
  title: expire snapshots                           # Title of the DAG
  dag:                                              # Directed Acyclic Graph (DAG)
    - name: expire                                  # Name of the Job
      title: expire snapshots                       # Title of the Job
      spec:                                         # Specs
        tags:                                       # Tags
          - Expire
        stack: flare:7.0                            # Stack is Flare
        compute: runnable-default                   # Compute
        stackSpec:                                  # Flare Stack specific Section
          job:                                      # Job Section
            explain: true                           # Explain
            logLevel: INFO                          # Loglevel
            inputs:                                 # Inputs Section
              - name: inputDf                       # Input Dataset Name
                dataset: dataos://lakehouse:retail/pos_store_product_cust?acl=rw        # Input UDL
                format: Iceberg                     # Format
            actions:                                # Action Section
              - name: expire_snapshots              # Name of Flare Action
                input: inputDf                      # Input Dataset Name
                options:                            # Options
                  olderThanMillis: "1741987433222"  # Timestamp in Unix Format (All snapshots older than the timestamp are expired)

retainLast

Remove snapshots older than specific day and time, but retain the last 5 snapshots:

name: expire-snapshots02                         # Name of the Workflow
version: v1                                      # Version
type: workflow                                   # Type of Resource (Here its workflow)
tags:                                            # Tags
  - expire
workflow:                                        # Workflow Section
  title: expire snapshots                        # Title of the DAG
  dag:                                           # Directed Acyclic Graph (DAG)
    - name: expire                               # Name of the Job
      title: expire snapshots                    # Title of the Job
      spec:                                      # Specs
        tags:                                    # Tags
          - Expire
        stack: flare:7.0                         # Stack is Flare
        compute: runnable-default                # Compute
        stackSpec:                               # Flare Stack specific Section
          job:                                   # Job Section
            explain: true                        # Explain
            logLevel: INFO                       # Loglevel
            inputs:                              # Inputs Section
              - name: inputDf                    # Input Dataset Name
                dataset: dataos://lakehouse:retail/pos_store_product_cust?acl=rw    # Input UDL
                format: Iceberg                  # Format
            actions:                             # Action Section
              - name: expire_snapshots           # Name of Flare Action
                input: inputDf                   # Input Dataset Name
                options:                         # Options
                  olderThanMillis: "1741987433222"    # Timestamp in Unix Format (All snapshots older than the timestamp are expired)
                  retainLast: 5                       # Retain the last 5 snapshots

snapshotIds

Remove snapshots with snapshot ID 12345679 (note that this snapshot ID should not be the current snapshot):

name: expire-snapshots03                         # Name of the Workflow
version: v1                                      # Version
type: workflow                                   # Type of Resource (Here its workflow)
tags:                                            # Tags
  - expire
workflow:                                        # Workflow Section
  title: expire snapshots                        # Title of the DAG
  dag:                                           # Directed Acyclic Graph (DAG)
    - name: expire                               # Name of the Job
      title: expire snapshots                    # Title of the Job
      spec:                                      # Specs
        tags:                                    # Tags
          - Expire
        stack: flare:7.0                         # Stack is Flare
        compute: runnable-default                # Compute
        stackSpec:                               # Flare Stack specific Section
          job:                                   # Job Section
            explain: true                        # Explain
            logLevel: INFO                       # Loglevel
            inputs:                              # Inputs Section
              - name: inputDf                    # Input Dataset Name
                dataset: dataos://lakehouse:sandbox3/test_pyflare2?acl=rw    # Input UDL
                format: Iceberg                  # Format
            actions:                             # Action Section
              - name: expire_snapshots           # Name of Flare Action     
                input: inputDf                   # Input Dataset Name                     # mandatory
                options:                         # Options                                # mandatory
                  snapshotIds:                   # Snapshots to delete by ID
                    - "12345679"                 # Snapshot with given snapshot ID will be deleted

You can also provide multiple snapshot Ids:

actions:                                         # Action Section
  - name: expire_snapshots                       # Name of Flare Action     
    input: inputDf                               # Input Dataset Name    (mandatory)
    options:                                     # Options               (mandatory)
      snapshotIds:                               # Snapshots to delete by ID
        - "1234567912"                           # Snapshot with given snapshot ID will be deleted
        - "1122344342"                           # Snapshot with given snapshot ID will be deleted

olderThanTimestamp

name: expire-snapshots04                                  # Name of the Workflow
version: v1                                      # Version
type: workflow                                   # Type of Resource (Here its workflow)
tags:                                            # Tags
  - expire
workflow:                                        # Workflow Section
  title: expire snapshots                        # Title of the DAG
  dag:                                           # Directed Acyclic Graph (DAG)
    - name: expire                               # Name of the Job
      title: expire snapshots                    # Title of the Job
      spec:                                      # Specs
        tags:                                    # Tags
          - Expire
        stack: flare:7.0                         # Stack is Flare
        compute: runnable-default                # Compute
        stackSpec:                               # Flare Stack specific Section
          job:                                   # Job Section
            explain: true                        # Explain
            logLevel: INFO                       # Loglevel
            inputs:                              # Inputs Section
              - name: inputDf                    # Input Dataset Name
                dataset: dataos://lakehouse:retail/pos_store_product_cust?acl=rw    # Input UDL
                format: Iceberg                  # Format
            actions:                             # Action Section
              - name: expire_snapshots           # Name of Flare Action
                input: inputDf                   # Input Dataset Name
                options:                         # Options
                  olderThanTimestamp: '2025-04-19 00:00:00.000'    # Timestamp (All snapshots older than the timestamp are expired)
                  retainLast: 2                                       # Retain the last 2 snapshots

maxConcurrentDeletes

The maxConcurrentDeletes attribute configures the size of the thread pool used for deleting files during snapshot expiration. By default, deletions happen sequentially. Using this option can significantly improve performance when expiring large numbers of snapshots or when the dataset has many files to delete.

name: expire-snapshots05                                  # Name of the Workflow
version: v1                                      # Version
type: workflow                                   # Type of Resource (Here its workflow)
tags:                                            # Tags
  - expire
workflow:                                        # Workflow Section
  title: expire snapshots                        # Title of the DAG
  dag:                                           # Directed Acyclic Graph (DAG)
    - name: expire                               # Name of the Job
      title: expire snapshots                    # Title of the Job
      spec:                                      # Specs
        tags:                                    # Tags
          - Expire
        stack: flare:7.0                         # Stack is Flare
        compute: runnable-default                # Compute
        stackSpec:                               # Flare Stack specific Section
          job:                                   # Job Section
            explain: true                        # Explain
            logLevel: INFO                       # Loglevel
            inputs:                              # Inputs Section
              - name: inputDf                    # Input Dataset Name
                dataset: dataos://lakehouse:retail/pos_store_product_cust?acl=rw    # Input UDL
                format: Iceberg                  # Format
            actions:                             # Action Section
              - name: expire_snapshots           # Name of Flare Action
                input: inputDf                   # Input Dataset Name
                options:                         # Options
                  olderThanMillis: "1741987433222"    # Expire snapshots older than this timestamp
                  maxConcurrentDeletes: 5           # Use 5 concurrent threads for delete operations

Info

Use this option for datasets with large amounts of files to improve delete performance. Be mindful of resource constraints when setting a high value.

streamDeleteResults

By default, Flare collects all files marked for deletion into memory at once. This can lead to memory issues for large datasets with many files. The streamDeleteResults attribute streams results to the driver using toLocalIterator, avoiding high memory usage by processing deletions incrementally.

name: expire-snapshots06                         # Name of the Workflow
version: v1                                      # Version
type: workflow                                   # Type of Resource (Here its workflow)
tags:                                            # Tags
  - expire
workflow:                                        # Workflow Section
  title: expire snapshots                        # Title of the DAG
  dag:                                           # Directed Acyclic Graph (DAG)
    - name: expire                               # Name of the Job
      title: expire snapshots                    # Title of the Job
      spec:                                      # Specs
        tags:                                    # Tags
          - Expire
        stack: flare:7.0                         # Stack is Flare
        compute: runnable-default                # Compute
        stackSpec:                               # Flare Stack specific Section
          job:                                   # Job Section
            explain: true                        # Explain
            logLevel: INFO                       # Loglevel
            inputs:                              # Inputs Section
              - name: inputDf                    # Input Dataset Name
                dataset: dataos://lakehouse:retail/pos_store_product_cust?acl=rw    # Input UDL
                format: Iceberg                  # Format
            actions:                             # Action Section
              - name: expire_snapshots           # Name of Flare Action
                input: inputDf                   # Input Dataset Name
                options:                         # Options
                  olderThanMillis: "1741987433222"    # Expire snapshots older than this timestamp
                  streamDeleteResults: true           # Stream deletion results to avoid memory pressure

Info

Set streamDeleteResults: true for large Iceberg tables where snapshot expiration involves deleting thousands of files. This reduces memory pressure on the driver at the cost of slightly slower delete operations.