Best Practices¶
This section outlines best practices for optimizing the Flash Service for efficient query processing and managing concurrent queries from multiple analysts without requiring service restarts or encountering performance issues.
Adding indexes¶
To enhance query performance, indexes can be created on columns frequently used in WHERE
clauses or JOIN
conditions. Indexes enable faster data retrieval by quickly locating records based on specific criteria. The in-memory database supports two primary types of indexes:
Min-Max index (Zonemap)¶
This index is automatically created for every column and is designed to speed up queries involving range searches, such as date ranges. By storing the minimum and maximum values for data blocks, the in-memory database can efficiently skip over data blocks that fall outside the specified range.
Example: For a table containing sales data with columns id
, amount
, and sale_date
, the following query retrieves sales records within a specified date range:
The Min-Max index on the sale_date
column enables the in-memory database to bypass data blocks that do not fall within the specified date range.
For more information on the Min-Max Index, refer to this section.
Adaptive Radix Tree (ART)¶
ART indexes must be explicitly defined by the user. They are effective for queries involving exact matches or columns with many unique values, providing rapid lookups.
Example: For a table containing user records with a unique identifier user_id
, the in-memory database automatically creates an ART index on this column. The following query finds a user by their ID:
The ART index allows the database to quickly locate the specified user_id
, ensuring fast query execution.
For additional details on the ART Index, refer to this section.
Creating and dropping indexes¶
Additional indexes can be created on other columns to further improve query performance. Below are examples for creating and dropping indexes:
-- Syntax for creating an index:
CREATE INDEX index_name ON table_name(column_name);
-- Example:
CREATE INDEX idx_state ON sales_360(state);
To remove an index, use the DROP INDEX
statement:
Adding configuration parameters¶
To optimize Flash for read-intensive queries and handle concurrent queries effectively, certain configuration parameters can be adjusted in the INIT
section of the YAML configuration.
Threads and Worker threads¶
The threads
and worker_threads
settings determine the number of threads the in-memory database uses for query execution. By default, these settings match the number of available CPU cores.
- Threads: Represents the maximum number of threads that the in-memory database will use for any operation.
-
Worker threads: A subset of the total
threads
, dedicated specifically to computational tasks during query execution. -
Adjusting the number of threads beyond the physical cores may improve throughput by allowing more concurrent operations. However, it is critical to balance this adjustment to avoid over-saturating the CPU, which can lead to contention and degraded performance.
- Thread contention: Excessive threads competing for CPU resources may result in contention, causing delays as threads wait for CPU access.
- Context switching: Increasing threads beyond available memory can lead to frequent context switching, where the CPU alternates between threads. This overhead can reduce performance as time is spent managing threads rather than executing them.
- Potential slowdown: Overuse of threads may slow query execution if the management overhead outweighs the benefits of parallel processing.
External threads¶
The external_threads
parameter specifies the number of threads for operations involving external resources, such as reading from remote files. This setting is particularly useful for parallelizing I/O operations, such as data retrieval from cloud storage or external databases.
- Increasing the number of
external_threads
can boost performance in data-intensive applications. - By default,
external_threads
is set to 1.
Example¶
-- Define these in the ENV variable FLASH_CONFIG_INIT_SQL section of the service YAML.
-- Query to check existing settings:
SELECT * FROM duckdb_settings() WHERE name IN ('external_threads', 'memory_limit', 'threads', 'worker_threads');
-- Queries to adjust these settings:
SET threads = <number_of_threads>;
SET external_threads = <number_of_threads>;
SET worker_threads = <number_of_threads>;
Using persistent volume¶
For workloads that exceed available memory, the in-memory databases use spilling to disk. This technique moves portions of data to disk when the dataset is too large for memory. Adding a persistent volume to the Flash Service enables effective management of such workloads. A persistent volume provides additional storage space for temporarily offloading data, allowing Flash to handle large datasets without exhausting memory resources.
The persistentVolume
attribute can be added to the Flash Service manifest file as shown below:
name: flash-test-old-10gb
version: v1
type: service
tags:
- service
description: Flash service
workspace: public
service:
servicePort: 5433
replicas: 1
stack: flash+python:1.0
persistentVolume:
name: duckdb-vol
directory: p_volume
logLevel: debug
compute: runnable-default
resources:
requests:
cpu: 2
memory: 1Gi
limits:
cpu: 2
memory: 2Gi
stackSpec:
datasets:
- address: dataos://icebase:flash/f_sales
name: numerous
- address: dataos://icebase:flash/product_data_master
name: product
- address: dataos://icebase:flash/site_check1
name: site
- address: dataos://icebase:flash/customer_data_master
name: customer
init:
- CREATE TABLE IF NOT EXISTS d_customer AS (SELECT * FROM customer);
- CREATE TABLE IF NOT EXISTS f_sales AS (SELECT * FROM numerous);
- CREATE TABLE IF NOT EXISTS d_product AS (SELECT * FROM product);
- CREATE TABLE IF NOT EXISTS d_site AS (SELECT * FROM site);
- SELECT * FROM duckdb_settings() WHERE name IN ('external_threads', 'memory_limit', 'threads', 'worker_threads', 'checkpoint_threshold');