Steps to create Amazon S3 Depot¶

Pre-requisites specific to Depot creation¶

Tags: A developer must possess the following tags, which can be obtained from a DataOS operator.

        NAME     │     ID      │  TYPE  │        EMAIL         │              TAGS               
    ─────────────┼─────────────┼────────┼──────────────────────┼─────────────────────────────────
    Iamgroot     │   iamgroot  │ person │   iamgroot@tmdc.io   │ roles:id:data-dev,                            
                 │             │        │                      │ roles:id:user,                  
                 │             │        │                      │ users:id:iamgroot

Use cases: Alternatively, instead of assigning tags, a developer can create a Depot if an operator grants them the "Manage All Instance-level Resources of DataOS in the user layer" use case through Bifrost Governance.

Bifrost Governance

Pre-requisites specific to the S3 Depot¶

To create a S3 Depot you must have the following details:

AWS access key ID: The Access Key ID used to authenticate and authorize API requests to your AWS account. This can be obtained from the AWS IAM (Identity and Access Management) Console under your user’s security credentials or requested from your AWS administrator.
AWS bucket name: The name of the Amazon S3 bucket where the data resides. You can find this in the AWS S3 Console under the list of buckets or request it from the administrator managing the storage.
Secret access key: The Secret Access Key associated with your AWS Access Key ID, is required for secure API requests. This is available in the AWS IAM Console under your user’s security credentials. Ensure that it is securely stored and shared only with authorized personnel.
Scheme: The scheme specifies the protocol to be used for the connection, such as s3 or https. This information depends on your system’s configuration and can be confirmed with the team managing the connection setup.
Relative Path: The path within the S3 bucket that points to the specific data or folder you want to access. This path is typically structured according to how your data is organized and can be obtained from the team managing the data or the AWS S3 Console.
Data format (format): format specifies the type of table format used to store the data in the container. Common values are 'ICEBERG' or 'DELTA', depending on how the data is organized.
Region: The AWS region where the S3 bucket is hosted (e.g., us-east-1, us-gov-east-1). If the region is not specified, DataOS will use the default AWS region.
Endpoint: The custom or standard S3 endpoint used to access the storage. Required when working with non-default AWS regions or S3-compatible services. Example: ${{s3.us-gov-east-1.amazonaws.com}}

Create a S3 Depot¶

Azure Blob File System Secure (ABFSS) is an object storage system. Object stores are distributed storage systems designed to store and manage large amounts of unstructured data.

DataOS enables the creation of a Depot of type 'Bigquery' to facilitate the reading of data stored in an Azure Blob Storage account. This Depot provides access to the storage account, which can consist of multiple containers. A container serves as a grouping mechanism for multiple blobs. It is recommended to define a separate Depot for each container. To create a Depot of type ‘ABFSS‘, follow the below steps:

Step 1: Create an Instance Secret for securing S3 credentials¶

Begin by creating an Instance Secret Resource by following the Instance Secret document.

Step 2: Create a S3 Depot manifest file¶

Begin by creating a manifest file to hold the configuration details for your S3 Depot.

name: ${{depot-name}}
version: v2alpha
type: depot
tags:
  - ${{tag1}}
owner: ${{owner-name}}
layer: user
description: ${{description}}
depot:
  type: S3                                          
  external: ${{true}}
  secrets:
    - name: ${{s3-instance-secret-name}}-r
      allkeys: true
    - name: ${{s3-instance-secret-name}}-rw
      allkeys: true
  s3:                                            
    scheme: ${{s3a}}
    bucket: ${{project-name}}
    relativePath: ${{relative-path}}
    format: ${{format}}
    region: ${{us-gov-east-1}}
    endpoint: ${{s3.us-gov-east-1.amazonaws.com}}

To get the details of each attribute, please refer to this link.

Step 3: Apply the Depot manifest file¶

Once you have the manifest file ready in your code editor, simply copy the path of the manifest file and apply it through the DataOS CLI by pasting the path in the placeholder, using the command given below:

CommandAlternative Command

dataos-ctl resource apply -f ${{yamlfilepath}}

dataos-ctl apply -f ${{yamlfilepath}}

Verify the Depot creation¶

To ensure that your Depot has been successfully created, you can verify it in two ways:

Check the name of the newly created Depot in the list of Depots where you are named as the owner:
```
dataos-ctl get -t depot
```
Additionally, retrieve the list of all Depots created in your organization:
```
dataos-ctl get -t depot -a
```

You can also access the details of any created Depot through the DataOS GUI in the Operations App and Metis UI.

Delete a Depot¶

If you need to delete a Depot, use the following command in the DataOS CLI:

CommandAlternative Command

dataos-ctl delete -t depot -n ${{name of Depot}}

dataos-ctl delete -f ${{path of your manifest file}}

By executing the above command, the specified Depot will be deleted from your DataOS environment.

Limit the data source's file format¶

Another important function that a Depot can play is to limit the file type that can read from and write to a particular data source. In the spec section of the config file, simply mention the format of the files you want to allow access to.

depot:
  type: S3
  description: ${{description}}
  external: true
  S3:
    scheme: ${{s3a}}
    bucket: ${{bucket-name}}
    relativePath: "raw" 
    format: ${{format}}  # mention the file format, such as DELTA, ICEBERG, PARQUET, etc

For file-based systems, if you define the format as ‘Iceberg’, you can choose the meta-store catalog between Hadoop and Hive. This is how you do it:

depot:
  type: ABFSS
  description: "ABFSS Iceberg Depot for sanity"
  compute: runnable-default
  S3:
    account: 
    container: 
    relativePath:
    format: ICEBERG
    endpointSuffix:
    icebergCatalogType: Hive

If you do not mention the catalog name as Hive, it will use Hadoop as the default catalog for Iceberg format. Hive automatically keeps the pointer updated to the latest metadata version. If you use Hadoop, you have to manually do this by running the set metadata command as described on this page: Set Metadata.