DataOS SDK for Python¶
The DataOS SDK for Python includes functionality to accelerate development with Python for the DataOS platform. It provides a cohesive set of APIs, each accessible through its respective services, enabling seamless interaction with the platform. By utilizing the SDK, data developers unlock the potential to construct innovative solutions and integrate them seamlessly into their existing resources.
Installation¶
Prerequisites¶
This section describes the steps to follow before installing DataOS SDK for Python.
Ensure you have Python ≥ 3.7 Installed
Prior to installation, ensure that you have Python 3.7 and above installed on your system. You can check the Python version by running:
For Linux/macOS
For Windows
If you do not have Python, please install the latest version from python.org.
Note: If you’re using an enhanced shell like IPython or Jupyter notebook, you can run system commands like those given above by prefacing them with a
!
character:
Ensure you have pip
installed
Ensure that pip
is available by running the following command:
For Linux/macOS
For Windows
If Python was installed from source, via an installer from python.org, or using Homebrew, pip
is typically included by default. On Linux systems where Python was installed using the operating system’s package manager, pip
may need to be installed separately. Refer to the Installing pip/setuptools/wheel with Linux Package Managers guide for instructions.
Installing dataos-sdk-py
from PyPI¶
The DataOS SDK for Python can be installed from the Python Package Index (PyPI) using the following command:
Recommendation
Install the dataos-sdk-py==0.0.1
version of DataOS SDK for Python, as it is the designated stable release.
For Linux/macOS
For Windows
Note: If you’re using an enhanced shell like IPython or Jupyter notebook, you must restart the runtime in order to use the newly installed package.
Getting Started¶
Upon successful installation of the DataOS SDK for Python, it is now time to commence your coding journey.
List Metadata of a Dataset¶
To get started, let's walk through a straightforward example of establishing a connection with DataOS, configuring it with the required parameters, and accessing metadata for a dataset.
Code Snippet
# Import the DepotServiceClientBuilder class from the depot_service.depot_service_client module
from depot_service.depot_service_client import DepotServiceClientBuilder
# Define your DataOS user API key token
# (Replace '{{dataos user apikey token}}' with your actual DataOS API key token. e.g. abcdefghijklmnopqrst)
# Fetch the DataOS User API key token using `dataos-ctl user apikey get/create` command on DataOS CLI
apikey = "{{dataos user apikey token}}"
# Define the base URL for the DataOS Depot service
# (Replace the '{{dataos instance fqdn}}' with your specific DataOS Instance FQDN e.g. https://sunny-prawn.dataos.app/ds)
base_url = "https://{{dataos instance fqdn}}/ds"
# Create an instance of the DepotServiceClientBuilder
ds_client = DepotServiceClientBuilder()
# Configure the DepotServiceClientBuilder with the API key and base URL
ds_client.set_apikey(apikey)
ds_client.set_base_url(base_url)
# Build the DepotServiceClient, which is ready to interact with DataOS
ds_client_instance = ds_client.build()
# Access the dataset_api and list metadata for a specific dataset
metadata = ds_client_instance.dataset_api.list_metadata(depot="lakehouse", collection="retail", dataset="city")
# Print Metadata
print(metadata)
Expected Output
[MetadataVersionResponse(version='v1.gz.metadata.json', timestamp=1696940109201), MetadataVersionResponse(version='v2.gz.metadata.json', timestamp=1696940212855), MetadataVersionResponse(version='v3.gz.metadata.json', timestamp=1697550809632), MetadataVersionResponse(version='v4.gz.metadata.json', timestamp=1698387825353), MetadataVersionResponse(version='v5.gz.metadata.json', timestamp=1699016002681)]
For additional information regarding the subpackages and submodules contained within the depot_service
package, please refer to the respective links within the Python SDK Library Reference.
Retrieve Dataset Statistics¶
In the following code snippet, we demonstrate the setup of the Depot Service client and the process of retrieving statistics for a dataset using the DataOS SDK for Python.
Code Snippet
# Import the DepotServiceClientBuilder class from the depot_service.depot_service_client module
from depot_service.depot_service_client import DepotServiceClientBuilder
# Create an instance of DepotServiceClientBuilder for setting up the DataOS client
ds_client = DepotServiceClientBuilder()
# Configure the DataOS client with the following parameters in a method chain:
# - Set the base URL for DataOS.
# - Provide the DataOS API key.
# - Build the DataOS client.
ds_client = ds_client.set_base_url("https://{{dataos fqdn}}/ds") \
.set_apikey("{{dataos user apikey token}}") \
.build()
# Show statistics for a specific dataset in DataOS by specifying the depot, collection, and dataset name.
stats = ds_client.dataset_api.show_stats(depot="{{depot name}}", collection="{{collection name}}", dataset="{{dataset name}}")
# Print the stats to the console.
print(stats)
Expected Output
stats = {
'totalRecords': '213500',
'totalPartitions': '0',
'totalSnapshots': '4',
'totalFileSize': '6742016',
'totalDataFiles': '4'
}
timeline = {
'1697550809632': {
'recordCount': '53375',
'operation': 'append',
'schema': {
"type": "record",
"name": "defaultName",
"fields": [
{"name": "__metadata", "type": {"type": "map", "values": "string", "key-id": 10, "value-id": 11}, "field-id": 1},
{"name": "city_id", "type": ["null", "string"], "default": None, "field-id": 2},
{"name": "zip_code", "type": ["null", "int"], "default": None, "field-id": 3},
{"name": "city_name", "type": ["null", "string"], "default": None, "field-id": 4},
{"name": "county_name", "type": ["null", "string"], "default": None, "field-id": 5},
{"name": "state_code", "type": ["null", "string"], "default": None, "field-id": 6},
{"name": "state_name", "type": ["null", "string"], "default": None, "field-id": 7},
{"name": "version", "type": "string", "field-id": 8},
{"name": "ts_city", "type": {"type": "long", "logicalType": "timestamp-micros", "adjust-to-utc": True}, "field-id": 9}
]
},
'versionFile': 'v3.gz.metadata.json'
},
'1696940212855': {
'recordCount': '53375',
'operation': 'append',
'schema': {
"type": "record",
"name": "defaultName",
"fields": [
{"name": "__metadata", "type": {"type": "map", "values": "string", "key-id": 10, "value-id": 11}, "field-id": 1},
{"name": "city_id", "type": ["null", "string"], "default": None, "field-id": 2},
{"name": "zip_code", "type": ["null", "int"], "default": None, "field-id": 3},
{"name": "city_name", "type": ["null", "string"], "default": None, "field-id": 4},
{"name": "county_name", "type": ["null", "string"], "default": None, "field-id": 5},
{"name": "state_code", "type": ["null", "string"], "default": None, "field-id": 6},
{"name": "state_name", "type": ["null", "string"], "default": None, "field-id": 7},
{"name": "version", "type": "string", "field-id": 8},
{"name": "ts_city", "type": {"type": "long", "logicalType": "timestamp-micros", "adjust-to-utc": True}, "field-id": 9}
]
},
'versionFile': 'v2.gz.metadata.json'
},
'1696940109201': {
'versionFile': 'v1.gz.metadata.json'
},
'1699016002681': {
'recordCount': '53375',
'operation': 'append',
'schema': {
"type": "record",
"name": "defaultName",
"fields": [
{"name": "__metadata", "type": {"type": "map", "values": "string", "key-id": 10, "value-id": 11}, "field-id": 1},
{"name": "city_id", "type": ["null", "string"], "default": None, "field-id": 2},
{"name": "zip_code", "type": ["null", "int"], "default": None, "field-id": 3},
{"name": "city_name", "type": ["null", "string"], "default": None, "field-id": 4},
{"name": "county_name", "type": ["null", "string"], "default": None, "field-id": 5},
{"name": "state_code", "type": ["null", "string"], "default": None, "field-id": 6},
{"name": "state_name", "type": ["null", "string"], "default": None, "field-id": 7},
{"name": "version", "type": "string", "field-id": 8},
{"name": "ts_city", "type": {"type": "long", "logicalType": "timestamp-micros", "adjust-to-utc": True}, "field-id": 9}
]
},
'versionFile': 'v5.gz.metadata.json'
},
'1698387825353': {
'recordCount': '53375',
'operation': 'append',
'schema': {
"type": "record",
"name": "defaultName",
"fields": [
{"name": "__metadata", "type": {"type": "map", "values": "string", "key-id": 10, "value-id": 11}, "field-id": 1},
{"name": "city_id", "type": ["null", "string"], "default": None, "field-id": 2},
{"name": "zip_code", "type": ["null", "int"], "default": None, "field-id": 3},
{"name": "city_name", "type": ["null", "string"], "default": None, "field-id": 4},
{"name": "county_name", "type": ["null", "string"], "default": None, "field-id": 5},
{"name": "state_code", "type": ["null", "string"], "default": None, "field-id": 6},
{"name": "state_name", "type": ["null", "string"], "default": None, "field-id": 7},
{"name": "version", "type": "string", "field-id": 8},
{"name": "ts_city", "type": {"type": "long", "logicalType": "timestamp-micros", "adjust-to-utc": True}, "field-id": 9}
]
},
'versionFile': 'v4.gz.metadata.json'
}
}
properties = {
'write.format.default': 'parquet',
'write.metadata.compression-codec': 'gzip'
}
Python SDK Library Reference¶
For a detailed reference guide on the Python SDK and its subpackages, modules, and classes, please visit the Python SDK Library Reference..