Depot¶
Depot in DataOS is a Resource that acts as an intermediary, facilitating connectivity to diverse data sources by abstracting the complexities associated with the underlying source system (including protocols, credentials, and connection schemas). It enables users to establish connections and retrieve data from various data sources, such as file systems (e.g., AWS S3, Google GCS, Azure Blob Storage), data lake systems (e.g., Icebase), database systems (e.g., Redshift, SnowflakeDB, Bigquery, Postgres), and event systems (e.g., Kafka, Pulsar).
The Depot serves as the registration of data locations to be made accessible to DataOS. Through the Depot Service each source system is assigned a unique address, referred to as a Uniform Data Link (UDL). The UDL grants convenient access and manipulation of data within the source system, eliminating the need for repetitive credential entry. The UDL follows this format:
dataos://[depot]:[collection]/[dataset]
Leveraging the UDL enables access to datasets and seamless execution of various operations, including data transformation using various Stacks and Policy assignment.
Regardless of the source system's internal naming conventions and structure, the UDL ensures consistency in referencing data. Within DataOS, the hierarchical structure of a data source is represented as follows:
Once this mapping is established, Depot Service automatically generates the Uniform Data Link (UDL) that can be used throughout DataOS to access the data. As a reminder, the UDL has the format: dataos://[depot]:[collection]/[dataset]
.
For simple file storage system, "Collection" can be analogous to "Folder," and "Dataset" can be equated to "File." The Depot's strength lies in its capacity to establish uniformity, eliminating concerns about varying source system terminologies.
Once a Depot is created, all members of an organization gain secure access to datasets within the associated source system. The Depot not only facilitates data access but also assigns default Access Policies to ensure data security. Moreover, users have the flexibility to define and utilize custom Access Policies for the depot and Data Policies for specific datasets within the Depot.
Structure of a Depot YAML¶
To know more about the attributes of Depot YAML Configuration, refer to the link: Attributes of Depot YAML
Depot Service¶
Depot Service is a DataOS Service that manages the Depot Resource. It facilitates in-depth introspection of depots and their associated storage engines. Once a Depot is created, users can obtain comprehensive information about the datasets contained within, including details such as constraints, partition, indexing, etc.
How to create a Depot?¶
To establish a Depot in DataOS, simply compose a YAML configuration file for a Depot and apply it using the DataOS Command Line Interface (CLI).
Prerequisites¶
Before proceeding with Depot creation, it is essential to ensure that you possess the required authorization. To confirm your eligibility, execute the following commands in the CLI:
dataos-ctl user get
# Expected Output
INFO[0000] 😃 user get...
INFO[0000] 😃 user get...complete
NAME | ID | TYPE | EMAIL | TAGS
-----------------|---------------|--------|------------------------|---------------------------------
IamGroot | iamgroot | person | iamgroot@tmdc.io | roles:id:data-dev,
| | | | roles:id:operator,
| | | | roles:id:system-dev,
| | | | roles:id:user,
| | | | users:id:iamgroot
To create Depots, ensure that you possess the following tags: roles:id:user
, roles:id:data-dev
, and roles:id:system-dev
.
The creation of a Depot involves three simple steps:
- Create a YAML configuration file.
- Apply the file using the DataOS CLI.
- Verify the successful creation of the Depot.
Create a YAML File¶
The YAML configuration file for a Depot can be divided into four main sections: Resource section, Depot-specific section, Connection Secrets section, and Specifications section. Each section serves a distinct purpose and contains specific attributes.
Configure Resource Section¶
The Resource section of the YAML configuration file consists of attributes that are common across all resource-types. The following YAML snippet demonstrates the key-value properties that need to be declared in this section:
name: ${{mydepot}}
version: v1
type: depot
tags:
- ${{dataos:type:resource}}
description: ${{This is a sample depot YAML configuration}}
owner: ${{iamgroot}}
layer: user
For more details regarding attributes in the Resource section, refer to the link: Attributes of Resource Section.
Configure Depot-specific Section¶
The Depot-specific section of the YAML configuration file includes key-value properties specific to the Depot-type being created. Each Depot-type represents a Depot created for a particular data source. Multiple Depots can be established for the same data source, and they will be considered as a single Depot-type. The following YAML snippet illustrates the key-values to be declared in this section:
depot:
type: ${{BIGQUERY}}
description: ${{description}}
external: ${{true}}
source: ${{bigquerymetadata}}
compute: ${{runnable-default}}
connectionSecrets:
{}
specs:
{}
The table below elucidates the various attributes in the Depot-specific section:
Attribute | Data Type | Default Value | Possible Value | Requirement |
---|---|---|---|---|
depot |
object | none | none | mandatory |
type |
string | none | ABFSS, WASBS, REDSHIFT, S3, ELASTICSEARCH, EVENTHUB, PULSAR, BIGQUERY, GCS, JDBC, MSSQL, MYSQL, OPENSEARCH, ORACLE, POSTGRES, SNOWFLAKE |
mandatory |
description |
string | none | any string | mandatory |
external |
boolean | false | true/false | mandatory |
connectionSecret |
object | none | varies between data sources | optional |
spec |
object | none | varies between data sources | mandatory |
compute |
string | runnable-default | any custom Compute Resource | optional |
source |
string | depot name | any string which is a valid depot name | optional |
Configure Connection Secrets Section¶
The configuration of connection secrets is specific to each Depot-type and depends on the underlying data source. The details for these connection secrets, such as credentials and authentication information, should be obtained from your enterprise or data source provider. For commonly used data sources, we have compiled the connection secrets here. Please refer to these templates for guidance on how to configure the connection secrets for your specific data source.
Examples
Here are examples demonstrating how the key-value properties can be defined for different depot-types:
For BigQuery, the connectionSecret
section of the configuration file would appear as follows:
#Properties depend on the underlying data source
connectionSecret:
- acl: rw
type: key-value-properties
data:
projectid: ${{project-name}}
email: ${{email-id}}
files:
json_keyfile: ${{secrets/gcp-demo-sa.json}} #JSON file containing the credentials to read-write
- acl: r
type: key-value-properties
files:
json_keyfile: ${{secrets/gcp-demo-sa.json}} #JSON file containing the credentials to read-only
This is how you can declare connection secrets to create a Depot for AWS S3 storage:
connectionSecret:
- acl: rw
type: key-value-properties
data: #credentials required to access aws
awsaccesskeyid: ${{AWS_ACCESS_KEY_ID}}
awsbucketname: ${{bucket-name}}
awssecretaccesskey: ${{AWS_SECRET_ACCESS_KEY}}
For accessing JDBC, all you need is a username and password. Check it out below:
connectionSecret:
- acl: rw
type: key-value-properties
data: #for JDBC, the credentials you get from the data source should have permission to read/write schemas of the database being accessed
username: ${{username}}
password: ${{password}}
The basic attributes filled in this section are provided in the table below:
Attribute | Data Type | Default Value | Possible Value | Requirement |
---|---|---|---|---|
acl |
string | none | r/rw | mandatory |
type |
string | none | key-value properties | mandatory |
data |
object | none | fields within data varies between data sources | mandatory |
files |
string | none | valid file path | optional |
Alternative Approach: Using Secrets
Secret is also a Resource in DataOS that allows users to securely store sensitive piece of information such as username, password, etc. Using Secrets in conjunction with Depots, Stacks allows for decoupling of sensitive information from Depot and Stack YAMLs. For more clarity, let’s take the example of MySQL data source to understand how you can use Secret Resource for Depot creation:
- Create a YAML file with the details on the connection secret:
name: ${{mysql-secret}}
version: v1
type: secret
secret:
type: key-value-properties
acl: rw
data:
connection-user: ${{user}}
connection-password: ${{password}}
- Apply this YAML file on DataOS CLI
If you have created this Secret in a public Workspace, any user within your enterprise can refer to the Secret by its name, "mysql-secret".
For example, if a user wishes to create a MySQL Depot, they can define a Depot configuration file as follows:
YAML Configuration File
By referencing the name of the Secret, "mysql-secret," users can easily incorporate the specified credentials into their Depot configuration. This approach ensures the secure handling and sharing of sensitive information.To learn more about Secrets as a Resource and their usage, refer to the documentation here
Configure Spec Section¶
The spec
section in the YAML configuration file plays a crucial role in directing the Depot to the precise location of your data and providing it with the hierarchical structure of the data source. By defining the specification parameters, you establish a mapping between the data and the hierarchy followed within DataOS.
Let's understand this hierarchy through real-world examples:
In the case of BigQuery, the data is structured as "Projects" containing "Datasets" that, in turn, contain "Tables". In DataOS terminology, the "Project" corresponds to the "Depot", the "Dataset" corresponds to the "Collection", and the "Table" corresponds to the "Dataset".
Consider the following structure in BigQuery:
- Project name:
bigquery-public-data
(Depot) - Dataset name:
covid19_usa
(Collection) - Table name:
datafile_01
(Dataset)
The UDL for accessing this data would be dataos://bigquery-public-data:covid19_usa/datafile_01
.
In the YAML example below, the necessary values are filled in to create a BigQuery Depot:
Bigquery Depot YAML Configuration
In this example, the Depot is named "covidbq" and references the project "bigquery-public-data" within Google Cloud. As a result, all the datasets and tables within this project can be accessed using the UDL dataos://covidbq:<collection name>/<dataset name>
.
By appropriately configuring the specifications, you ensure that the Depot is accurately linked to the data source's structure, enabling seamless access and manipulation of datasets within DataOS.
Depot provides flexibility in mapping the hierarchy for file storage systems. Let's consider the example of an Amazon S3 bucket, which has a flat structure consisting of buckets, folders, and objects. By understanding the hierarchy and utilizing the appropriate configurations, you can effectively map the structure to DataOS components.
Here's an example of creating a depot named 's3depot' that maps the following structure:
- Bucket:
abcdata
(Depot) - Folder:
transactions
(Collection) - Objects:
file1
andfile2
(Datasets)
In the YAML configuration, specify the bucket name and the relative path to the folder. The YAML example below demonstrates how this can be achieved:
name: s3depot
version: v1
type: depot
tags:
- S3
layer: user
depot:
type: S3
description: "AWS S3 Bucket for dummy data"
external: true
spec:
bucket: "abcdata"
relativePath:
If you omit the relativePath
in the YAML configuration, the bucket itself becomes the depot in DataOS. In this case, the following UDLs can be used to read the data:
dataos://s3depot:transactions/file1
dataos://s3depot:transactions/file2
Additionally, if there are objects present in the bucket outside the folder, you can use the following UDLs to read them:
dataos://s3depot:none/online-transaction
dataos://s3depot:none/offline-transaction
However, if you prefer to treat the 'transactions' folder itself as another object within the bucket rather than a folder, you can modify the UDLs as follows:
dataos://s3depot:none/transactions/file1
dataos://s3depot:none/transactions/file2
In this case, the interpretation is that there is no collection in the bucket, and 'file1' and 'file2' are directly accessed as objects with the path '/transactions/file1' and '/transactions/file2'.
When configuring the YAML for S3, if you include the relativePath
as shown below, the 'transactions' folder is positioned as the depot:
name: s3depot
version: v1
type: depot
tags:
- S3
layer: user
depot:
type: S3
description: "AWS S3 Bucket for dummy data"
external: true
spec:
bucket: "abcdata"
relativePath: "/transactions"
Since the folder ‘transactions’ in the bucket has now been positioned as the depot, two things happen.
First, you cannot read the object files online-transaction and offline-transaction using this depot.
Secondly with this setup, you can read the files within the 'transactions' folder using the following UDLs:
dataos://s3depot:none/file1
dataos://s3depot:none/file2
For accessing data from Kafka, where the structure consists of a broker list and topics, the spec
section in the YAML configuration will point the depot to the broker list, and the datasets will map to the topic list. The format of the YAML will be as follows:
depot:
type: KAFKA
description: ${{description}}
external: true
spec:
brokers:
- ${{broker1}}
- ${{broker2}}
Apply Depot YAML¶
Once you have the YAML file ready in your code editor, simply copy the path of the YAML file and apply it through the DataOS CLI, using the command given below:
Verify Depot Creation¶
To ensure that your depot has been successfully created, you can verify it in two ways:
- Check the name of the newly created depot in the list of depots where you are named as the owner:
- Alternatively, retrieve the list of all depots created in your organization:
You can also access the details of any created Depot through the DataOS GUI in the Operations App and Metis UI.
Delete Depot¶
If you need to delete a depot, use the following command in the DataOS CLI:
By executing the above command, the specified depot will be deleted from your DataOS environment.
Supported Storage Architectures in DataOS¶
DataOS Depots facilitate seamless connectivity with diverse storage systems while eliminating the need for data relocation. This resolves challenges pertaining to accessibility across heterogeneous data sources. However, the escalating intricacy of pipelines and the exponential growth of data pose potential issues, resulting in cumbersome, expensive, and unattainable storage solutions. In order to address this critical concern, DataOS introduces support for two distinct and specialized storage architectures - Icebase Depot, the Unified Lakehouse designed for OLAP data, and Fastbase Depot, the Unified Streaming solution tailored for handling streaming data.
Icebase¶
Icebase-type depots are designed to store data suitable for OLAP processes. It offers built-in functionalities such as schema evolution, upsert commands, and time-travel capabilities for datasets. With Icebase, you can conveniently perform these actions directly through the DataOS CLI, eliminating the need for additional Stacks like Flare. Moreover, queries executed on data stored in Icebase exhibit enhanced performance. For detailed information, refer to the Icebase page.
Fastbase¶
Fastbase type-depots are optimized for handling streaming data workloads. It provides features such as creating and listing topics, which can be executed effortlessly using the DataOS CLI. To explore Fastbase further, consult the link.
How to utilize Depots?¶
Once a Depot is created, you can leverage its Uniform Data Links (UDLs) to access data without physically moving it. The UDLs play a crucial role in various scenarios within DataOS.
Work with Stacks¶
Depots are compatible with different Stacks in DataOS. Stacks provide distinct approaches to interact with the system and enable various programming paradigms in DataOS. Several Stacks are available that can be utilized with depots, including Scanner for introspecting depots, Flare for data ingestion, transformation, syndication, etc., Benthos for stream processing and Data Toolbox for managing Icebase DDL and DML.
Flare and Scanner Stacks are supported by all Depots, while Benthos, the stream-processing Stack, is compatible with read/write operations from streaming depots like Fastbase and Kafka Depots.
The UDL references are used as addresses for your input and output datasets within the YAML configuration file.
Referencing Depots within Stack YAML Configuration
Flare YAML Input/Output UDL# A section of the complete YAML file for Flare
inputs:
- name: customer_connect
dataset: dataos://crmbq:demo/customer_profiles # Example of input UDL
outputs:
- name: output01
depot: dataos://filebase:raw01 # Example of output UDL
Limit Data Source's File Format¶
Another important function that a Depot can play is to limit the file type which you can read from and write to a particular data source. In the spec
section of YAML config file, simply mention the format
of the files you want to allow access for.
depot:
type: S3
description: ${{description}}
external: true
spec:
scheme: ${{s3a}}
bucket: ${{bucket-name}}
relativePath: "raw"
format: ${{format}} # mention the file format, such as JSON, to only allow that file type
For File based systems, if you define the format as ‘Iceberg’, you can choose the meta-store catalog between Hadoop and Hive. This is how you do it:
depot:
type: ABFSS
description: "ABFSS Iceberg depot for sanity"
compute: runnable-default
spec:
account:
container:
relativePath:
format: ICEBERG
endpointSuffix:
icebergCatalogType: Hive
If you do not mention the catalog name as Hive, it will use Hadoop as the default catalog for Iceberg format.
Hive, automatically keeps the pointer updated to the latest metadata version. If you use Hadoop, you have to manually do this by running the set metadata command as described on this page: Set Metadata
Scan and Catalog Metadata¶
By running the Scanner, you can scan the metadata from a source system via the Depot interface. Once the metadata is scanned, you can utilize Metis to catalog and explore the metadata in a structured manner. This allows for efficient management and organization of data resources.
Add Depot to Cluster Sources to Query the Data¶
To enable the Minerva Query Engine to access a specific source system, you can add the Depot to the list of sources in the Cluster. This allows you to query the data and create dashboards using the DataOS Workbench and Atlas.
Create Policies upon Depots to Govern the Data¶
Access and Data Policies can be created upon Depots to govern the data. This helps in reducing data breach risks and simplifying compliance with regulatory requirements. Access Policies can restrict access to specific depots, collections, or datasets, while Data Policies allow you to control the visibility and usage of data.
Building Data Models¶
You can use Lens to create Data Models on top of Depots and explore them using the Lens App UI.
Depot Configuration Templates¶
To facilitate the creation of depots accessing commonly used data sources, we have compiled a collection of pre-defined YAML configuration templates. These templates serve as a starting point, allowing you to quickly set up depots for popular data sources. You can access the list of these templates by visiting the following page: Depot Config Templates
Data Integration - Supported Connectors in DataOS¶
The catalogue of data sources accessible by one or more components within DataOS is provided on the following page: Supported Connectors in DataOS