Querying Diverse Data Sources with DataOS Workbench¶
Information
This session focuses on leveraging DataOS Workbench to query diverse data sources effectively, enabling organizations to derive meaningful insights from vast amounts of data stored across various platforms.
Organizations handle vast amounts of data across various storage systems, such as relational databases (e.g., MySQL, SQL Server, PostgreSQL) and object storage systems (e.g., S3, HDFS, Azure). The challenge lies in swiftly accessing the right data to derive meaningful insights.
DataOS query engines simplify this process by allowing SQL queries across relational and non-relational databases and object stores. With Minerva and Themis clusters, users can query data stored in diverse sources without needing to learn each database's specific query language. It streamlines data access and analysis, making it easier to derive insights. DataOS Workbench provides a unified interface for querying and analyzing diverse datasets.
Key Steps¶
Querying Across Data Sources¶
Let's consider an example where you need to analyze sales data stored in a PostgreSQL database and customer data stored in Microsoft Azure to identify the Most valuable customers as per Order Quantity. With DataOS Workbench, you can seamlessly connect to these disparate data sources and write federated SQL queries to combine the relevant data.
Step 1: Connect to Data Sources¶
Ensure that Depots for accessing data sources are created and the Minerva cluster is properly configured to access these sources.
For the above query, we need two depots to fetch data from different sources [Azure MS-SQL and PostgreSQL database]. We also need a Minerva Cluster to run analytical queries with joins.
Step 2: Open Workbench App and Explore Data¶
-
Select catalog. Under Catalog, you can see your data sources.
-
Write a query to view the data.
The following query shows the customer data from the Microsoft Azure data source.
The following query shows the sales data from the PostgreSQL data source.
Step 3: Analytical Query on Heterogeneous Data Sources¶
-
Combine data from different data sources using a federated SQL query.
This query will join the data from both data sources.
-- Highest valuable customers as per Order Quantity SELECT B.customerkey, sum(A.orderquantity) as total_order FROM "postgresqldbdatabase"."public".sales as A join "azuresqldb"."dbo".customers as B on A.customerkey = B.customerkey group by 1 order by 2 desc limit 5;
Here is the result of the above query that shows the most valuable customers as per order quantity.