Skip to content

Remove Orphans

The remove_orphans action cleans up orphans files older than a specified time period. This action may take a long time to finish if you have lots of files in data and metadata directories. It is recommended to execute this periodically, but you may not need to execute this often.

Note

It is dangerous to remove orphan files with a retention interval shorter than the time expected for any write to complete because it might corrupt the table if in-progress files are considered orphaned and are deleted. The default interval is 3 days.

Get the list of snapshots by writing the following command

dataos-ctl dataset snapshots -a dataos://lakehouse:retail/city

Expected output

      SNAPSHOTID         TIMESTAMP       DATE AND TIME (GMT)     
──────────────────────┼───────────────┼────────────────────────────
  7002479430618666161  1740643647492  2025-02-27T08:07:27+00:00  
  2926095925031493170  1740737372219  2025-02-28T10:09:32+00:00  

Configuration

Attribute Type Description
older_than timestamp Remove orphan files created before this timestamp. Defaults to 3 days ago.
location string Directory to look for files in. Defaults to the table's location.
dry_run boolean If true, performs a dry run without actually removing files. Defaults to false.
max_concurrent_deletes int Size of the thread pool used for delete operations. By default, no thread pool is used.

The following code snippet demonstrates removing orphan files older than the time specified in the olderThan in Unix epoch format.

The task relies on the remove_orphans action, which requires the inputDf dataset as an input. This dataset is defined as dataos://lakehouse:retail/city and is in Iceberg format. Additionally, the action provides options, such as the olderThan parameter, which specifies the timestamp (in Unix format) for identifying orphan files.

name: orphans                                    # Name of the Workflow
version: v1                                      # Version
type: workflow                                   # Type of Resource (Here its workflow)
tags:                                            # Tags
  - orphans
workflow:                                        # Workflow Section
  title: Remove orphan files                     # Title of the DAG
  dag:                                           # Directed Acyclic Graph (DAG)
    - name: orphans                              # Name of the Job
      title: Remove orphan files                 # Title of the Job
      spec:                                      # Specs
        tags:                                    # Tags
          - orphans
        stack: flare:5.0                         # Stack is Flare
        compute: runnable-default                # Compute
        stackSpec:                               # Flare Stack Specific Section
          job:                                   # Job Section
            explain: true                        # Explain
            logLevel: INFO                       # Loglevel
            inputs:                              # Inputs Section
              - name: inputDf                    # Input Dataset Name
                dataset: dataos://lakehouse:retail/city                # Input UDL
                format: Iceberg                  # Format
            actions:                             # Flare Action
              - name: remove_orphans             # Action Name
                input: inputDf                   # Input Dataset Name
                options:                         # Options
                  olderThan: "1739734172"        # Timestamp in Unix Format