Configurations

This section provides details of configuring and indexing Lakehouse tables via Lakesearch.

# service specific section
name: testingls
version: v1
type: service
tags:
  - service
  - dataos:type:resource
  - dataos:resource:service
  - dataos:layer:user
description: Lakesearch Service Simple Index Config
workspace: public
service:
  servicePort: 4080
  ingress:
    enabled: true
    stripPath: false
    path: /lakesearch/public:testingls
    noAuthentication: true
  replicas: 1
  logLevel: 'DEBUG'
  compute: runnable-default
  envs:
    LAKESEARCH_SERVER_NAME: "/lakesearch/public:testingls"
    DATA_DIR: public/testingls/sample
    USER_MODULES_DIR: /etc/dataos/config
  persistentVolume:
    name: ls-v2-test-vol
    directory: public/testingls/sample
  resources:
    requests:
      cpu: 1000m
      memory: 1536Mi
  stack: lakesearch:1.0
  stackSpec:
    lakesearch:
# index specific section    
      source:
        datasets:
          - name: city
            dataset: dataos://lakehouse:retail/city
            options:
              region: us-gov-east-1
              endpoint: s3.us-gov-east-1.amazonaws.com
      index_tables:
        - name: city
          description: "index for cities"
          tags:
            - cities
          properties:
            morphology: stem_en
          columns:
            - name: city_id
              type: keyword
            - name: zip_code
              type: bigint  
            - name: id
              description: "mapped to row_num"
              tags:
                - identifier
              type: bigint
            - name: city_name
              type: keyword
            - name: county_name
              type: keyword
            - name: state_code
              type: keyword
            - name: state_name
              type: text
            - name: version
              type: text
            - name: ts_city
              type: timestamp

      indexers:
        - index_table: city
          base_sql: |
            SELECT 
              city_id,
              zip_code,
              zip_code as id,
              city_name,
              county_name,
              state_code,
              state_name,
              version,
              cast(ts_city as timestamp) as ts_city

            FROM 
              city
          options:
            start: 1734979551
            start_query: "SELECT max(source_ts_ms) FROM products"
            step: 86400
            batch_sql: |
              WITH base AS (
                  {base_sql}
              ) SELECT 
                * 
              FROM 
                base 
              WHERE 
                epoch(ts_city) >= {start} AND epoch(ts_city) < {end}
            throttle:
              min: 10000
              max: 60000
              factor: 1.2
              jitter: true

The service YAML is divided into two main sections, Service-specific configurations and Index specific configurations.

Service-specific configuration¶

name¶

Description: Declare a name for the Resource.

Data Type	Requirement	Default Value	Possible Value
string	mandatory	none	Alphanumeric values with regex `[a-z0-9]([-a-z0-9]*[a-z0-9])` (max length: 48 characters)

Additional Information: Two resources in the same workspace cannot have the same name.

Example usage:

name: testingls

version¶

Description: The version of the Resource.

Data Type	Requirement	Default Value	Possible Value
string	mandatory	none	v1, v1beta, v1alpha, v2alpha

Example usage:

version: v1

type¶

Description: Provide the type of the Resource.

Data Type	Requirement	Default Value	Possible Value
string	mandatory	none	cluster, compute, depot, policy, secret, service, stack, workflow

Example usage:

type: service

tags¶

Description: Assign tags to the Resource instance.

Data Type	Requirement	Default Value	Possible Value
list	mandatory	none	Any string; special characters allowed

Example usage:

tags:
  - service
  - dataos:type:resource
  - dataos:resource:service
  - dataos:layer:user

description¶

Description: Assign a description to the Resource.

Data Type	Requirement	Default Value	Possible Value
string	optional	none	Any string

Additional Information: The description can be within quotes or without.

Example usage:

description: "Lakesearch Service Simple Index Config"

workspace¶

Description: Defines the workspace where the Resource belongs.

Data Type	Requirement	Default Value	Possible Value
string	mandatory	none	Any valid workspace name

Example usage:

workspace: public

service¶

Description: Defines the service configuration.

Data Type	Requirement	Default Value	Possible Value
mapping	mandatory	none	Contains attributes like `servicePort`, `ingress`, `replicas`, `logLevel`, `compute`, etc.

Example usage:

service:
  servicePort: 4080
  ingress:
    enabled: true
    stripPath: false
    path: /lakesearch/public:testingls
    noAuthentication: true
  replicas: 1
  logLevel: 'DEBUG'
  compute: runnable-default

servicePort¶

Description: Defines the port on which the service runs.

Data Type	Requirement	Default Value	Possible Value
integer	mandatory	none	Any valid port number (e.g., 1024–65535)

Example usage:

servicePort: 4080

ingress¶

Description:

The ingress attribute defines the network exposure configuration for the Lakesearch service. It controls how external requests are routed to the service, whether authentication is required, and how paths are handled.

Data Type	Requirement	Default Value	Possible Values
object	optional	none	Contains enabled, stripPath, path, and noAuthentication

ingress.enabled¶

Description:

Determines whether ingress is enabled for exposing the service.

Data Type	Requirement	Default Value	Possible Values
boolean	mandatory	false	`true` (ingress is enabled), `false` (ingress is disabled)

Example Usage:

ingress:
  enabled: true

ingress.stripPath¶

Description:

Controls whether the request path should be stripped before forwarding it to the service.

Data Type	Requirement	Default Value	Possible Values
boolean	optional	false	`true` (removes path prefix), `false` (keeps full path)

Example Usage:

ingress:
  stripPath: false

ingress.path¶

Description:

Specifies the URL path that will be exposed through ingress.

Data Type	Requirement	Default Value	Possible Values
string	mandatory	none	A valid URL path

Example Usage:

ingress:
  path: /lakesearch/public:testingls

In this example, the service is exposed at /lakesearch/public:testingls.

ingress.noAuthentication¶

Description:

Defines whether authentication is required to access the exposed service.

Data Type	Requirement	Default Value	Possible Values
boolean	optional	false	`true` (no authentication required), `false` (authentication required)

Example Usage:

ingress:
  noAuthentication: true

This allows unrestricted public access to the service.

replicas¶

Description: Specifies the number of service instances to run.

Data Type	Requirement	Default Value	Possible Value
integer	optional	1	Any positive integer

Example usage:

replicas: 1

logLevel¶

Description: Defines the logging level for debugging and monitoring.

Data Type	Requirement	Default Value	Possible Value
string	optional	INFO	DEBUG, INFO, WARN, ERROR

Example usage:

logLevel: 'DEBUG'

compute¶

Description: Specifies the compute environment where the service runs.

Data Type	Requirement	Default Value	Possible Value
string	optional	none	runnable-default, custom environments

Example usage:

compute: runnable-default

envs¶

Description: Defines environment variables for configuring the service.

Data Type	Requirement	Default Value	Possible Value
object	optional	none	Key-value pairs defining runtime settings

Example usage:

envs:
  LAKESEARCH_SERVER_NAME: "public:testingls"
  DATA_DIR: public/testingls/sample
  USER_MODULES_DIR: /etc/dataos/config

LAKESEARCH_SERVER_NAME¶

Description: Defines the server name that exposes the ingress path.

Data Type	Requirement	Default Value	Possible Values
string	mandatory	none	Any alphanumeric string following naming conventions and `<workspace>:<servie_name>` nomenclature

Example Usage:

envs:
  LAKESEARCH_SERVER_NAME: "public:testingls"

In this example, the server name is set to public:testingls, which uniquely identifies the Lakesearch instance.

DATA_DIR¶

Description: Sets the path for the persistent volume where indexes are stored. This always follows the below nomenclature.

Data Type	Requirement	Default Value	Possible Values
string	mandatory	none	A valid directory path that follows `<workspace>/<servie_name>/<folder_name>` nomenclature

Example Usage:

envs:
  DATA_DIR: public/testingls/sample

This configuration sets the data directory to public/testingls/sample, where Lakesearch stores its indexed datasets.

USER_MODULES_DIR¶

Description:

Specifies the directory path for user-defined modules and configurations used by Lakesearch.

Data Type	Requirement	Default Value	Possible Values
string	optional	none	A valid absolute directory path

Example Usage:

envs:
  USER_MODULES_DIR: /etc/dataos/config

This indicates that user-specific configuration files are located in /etc/dataos/config.

persistentVolume¶

Description: Specifies a common storage location for Lakehouse services. This accepts the following two keys:

name: Name of the persistent volume claim.
directory: Path where you intend to store the indexed documents. It is important to ensure that the path defined here and the one defined in the DATA_DIR environment variable are the same. This always follows the below nomenclature.

Data Type	Requirement	Default Value	Possible Value
object	mandatory	none	Defines storage location with `<workspace>/<servie_name>/<folder_name>` nomenclature and volume name

If a Volume is not already configured in the environment, create a new one by referring to this link.

Example usage:

persistentVolume:
  name: ls-v2-test-vol
  directory: public/testingls/sample

resources¶

Description: Specifies resource requests and limits for compute allocation.

Data Type	Requirement	Default Value	Possible Value
object	optional	none	Defines CPU and memory allocation

Additional Information:

requests.cpu: CPU units requested (e.g., 1000m = 1 core).
requests.memory: Memory requested (in MiB or GiB).

Example usage:

resources:
  requests:
    cpu: 1000m
    memory: 1536Mi

stack¶

Description: Indicates the version of the Lakesearch Stack for the Service to run.

Data Type	Requirement	Default Value	Possible Value
string	mandatory	none	Format: `<stack-name>:<version>`

Example usage:

stack: lakesearch:6.0

stackSpec¶

Description: The stackSpec attribute defines the stack-specific configurations required to deploy and manage the indexing pipeline for Lakesearch. It encapsulates settings related to data sources, indexing tables, and indexing processes.

Data Type	Requirement	Default Value	Possible Value
object	mandatory	none	Contains lakesearch configurations

lakesearch¶

Description: The lakesearch attribute within stackSpec specifies configurations for indexing data in Lakesearch, including data sources, index tables, and indexing processes.

Data Type	Requirement	Default Value	Possible Value
object	mandatory	none	Contains source, index_tables, and indexers

Index specific configuration¶

This section handles specific table details and indexing configurations. Lakesearch supports indexing either a single table or multiple tables within a service, depending on the setup in the stackSpec.

source¶

Description: Defines the dataset(s) that act as input for indexing.

Data Type	Requirement	Default Value	Possible Value
object	mandatory	none	Contains datasets

Lakesearch supports four types of sources.

DatasetFlashPostgreSQLDepot (Postgres)

SyntaxExample

```yaml source: datasets: - name: dataset: options: (optional) region: (required for Env. created on AWS cloud) endpoint: (required for S3 source)

source:
  datasets:
    - name: devices
      dataset: dataos://lakehouse:lenovo_ls_data/devices_with_d
      options:

SyntaxExample

source:
  flash: <workspace>:<name of Flash service>
  options:
    sslmode: disable
# or
source:
  postgres: flash://<workspace>.<name of Flash service>

source:
  flash: public:flash-test-9
  options:
    sslmode: disable
---- OR ----
source:
  postgres: flash://public.flash-test-9

SyntaxExample

source:
  postgres: postgresql://username:password@host:port/database?sslmode=disable
---- OR ----
source:
  postgres: depot://<depot_name>

source:
  postgres: postgresql://admin:admin@flash:5433/main?sslmode=disable
---- OR ----
source:
  postgres: depot://stpostgres

SyntaxExample

source:
  depot: dataos://<name of the depot>

source:
  depot: dataos://mysqltest

datasets¶

Description: Defines the input datasets used by Lakesearch.

Data Type	Requirement	Default Value	Possible Value
list	mandatory	none	List of dataset objects containing `name`, `dataset`, and `options`

Additional Information:

name: Defines the dataset identifier.
dataset: Specifies the dataset reference.
options: Contains configuration settings (e.g., region).

Example usage:

source:
  datasets:
    - name: city
      dataset: dataos://lakehouse:retail/city
      options:
        region: us-gov-east-1
        endpoint: s3.us-gov-east-1.amazonaws.com

datasets.name¶

Description: The identifier for the dataset within the Lakesearch Serviceconfiguration. This is a user-defined name used to reference the dataset in other parts of the configuration (such as indexers).

Data Type	Requirement	Default Value	Possible Value
string	mandatory	none	Any alphanumeric string

Example usage:

datasets:
  - name: city

datasets.dataset¶

Description: The fully qualified reference to the data source. This typically follows the format dataos://<catalog>:<schema>/<dataset> for Lakehouse sources.

Data Type	Requirement	Default Value	Possible Value
string	mandatory	none	A valid dataset URI (e.g., `dataos://lakehouse:retail/city`)

Example usage:

datasets:
  - name: city
    dataset: dataos://lakehouse:retail/city

datasets.options¶

Description: Additional configuration options for the dataset source. For S3 sources, it is recommended to specify both region and endpoint for correct connectivity, especially in non-default or government regions. For other source types, this field may be omitted or used for other relevant options.

Data Type	Requirement	Default Value	Possible Value
object	optional	none	Key-value pairs for source-specific settings

Example usage:

options:
  region: us-gov-east-1
  endpoint: s3.us-gov-east-1.amazonaws.com

datasets.options.region¶

Description: Specifies the AWS region where the S3 bucket is located. Required for environments created on AWS cloud.

Data Type	Requirement	Default Value	Possible Value
string	recommended	none	Any valid AWS region (e.g., `us-east-1`, `ap-south-1`, `us-gov-east-1`)

Example usage:

region: us-gov-east-1

datasets.options.endpoint¶

Description: Specifies the S3 service endpoint. This is recommended for S3 sources, especially when using non-default or government regions. For other data sources, this is not required.

Data Type	Requirement	Default Value	Possible Value
string	recommended for S3	none	Any valid S3 endpoint (e.g., `s3.us-gov-east-1.amazonaws.com`)

Example usage:

endpoint: s3.us-gov-east-1.amazonaws.com

index_tables¶

Description: Defines the structure and metadata of an index table used in Lakesearch.

Data Type	Requirement	Default Value	Possible Value
list	mandatory	none	List of index table definitions

index_tables.name¶

Description: The name of the index table.

Data Type	Requirement	Default Value	Possible Value
string	mandatory	none	Alphanumeric string (max length: 48 characters)

Example usage:

name: city

index_tables.description¶

Description: A brief description of the index table's purpose.

Data Type	Requirement	Default Value	Possible Value
string	optional	none	Any descriptive text

Example usage:

description: "index for cities"

index_tables.tags¶

Description: Labels used to categorize the index table.

Data Type	Requirement	Default Value	Possible Value
list	optional	none	List of strings

Example usage:

tags:
  - cities

index_tables.properties¶

Description: Defines specific properties for text processing and indexing behavior.

Data Type	Requirement	Default Value	Possible Value
object	optional	none	Configuration parameters (e.g., morphology)

Example usage:

properties:
  morphology: stem_en

index_tables.columns¶

Description: Specifies the schema of the indexed data, including column names, types, and metadata. Users can choose to include all columns from the dataset or exclude specific columns from indexing based on their requirements..

🗣️ Note that while adding the columns while indexing a table it is required to add an additional column named `id` of type `bigint` as shown below:

      index_tables:
        - name: newcity
          description: "index for cities"
          tags:
            - cities
          properties:
            morphology: stem_en
          columns:
            - name: city_id
              type: keyword
            - name: zip_code
              type: bigint  
            - name: id              # additional column (mandatory)
              description: "mapped to zip_code"
              tags:
                - identifier
              type: bigint
            - name: city_name
              type: keyword
            - name: county_name
              type: keyword
            - name: state_code
              type: keyword
            - name: state_name
              type: text
            - name: version
              type: text
            - name: ts_city
              type: timestamp

This will be mapped with the primary key column in the indexer base SQL as `id`.

      indexers:
        - index_table: newcity
          base_sql: |
            SELECT 
              city_id,
              zip_code,
              zip_code as id,  # primary key
              city_name,
              county_name,
              state_code,
              state_name,
              version,
              cast(ts_city as timestamp) as ts_city

            FROM 
              city

Data Type	Requirement	Default Value	Possible Value
list	mandatory	none	List of column definitions

index_tables.columns.name¶

Description: The name of the column in the index table.

Data Type	Requirement	Default Value	Possible Value
string	mandatory	none	Alphanumeric string

Example usage:

name: city_id

index_tables.columns.type¶

Description: The data type of the column.

Data Type	Requirement	Default Value	Possible Value
string	mandatory	none	keyword, bigint, text, timestamp, etc.

Example usage:

type: keyword

Data types allowed

Data Type	Description	Category
`text`	The text data type forms the full-text part of the table. Text fields are indexed and can be searched for keywords.	full-text field
`keyword`	Unlike full-text fields, string attributes are stored as they are received and cannot be used in full-text searches. Instead, they are returned in results, can be used to filter, sort and aggregate results. In general, it's not recommended to store large texts in string attributes, but use string attributes for metadata like names, titles, tags, keys, etc.	attribute
`int`	Integer type allows storing 32 bit unsigned integer values.	attribute
`bigint`	Big integers (bigint) are 64-bit wide signed integers.	attribute
`bool`	Declares a boolean attribute. It's equivalent to an integer attribute with bit count of 1.	attribute
`timestamp`	The timestamp type represents Unix timestamps, which are stored as 32-bit integers. The system expects a date/timestamp type object from the base_sql.	attribute
`float`	Real numbers are stored as 32-bit IEEE 754 single precision floats.	attribute

index_tables.columns.description¶

Description: A brief explanation of the column's role.

Data Type	Requirement	Default Value	Possible Value
string	optional	none	Any descriptive text

Example usage:

description: "mapped to zip_code"

index_tables.columns.tags¶

Description: Labels to classify the column (e.g., identifier columns).

Data Type	Requirement	Default Value	Possible Value
list	optional	none	List of strings

Example usage:

tags:
  - identifier

indexers¶

Description: Defines indexing operations for tables, including SQL queries and execution parameters.

Data Type	Requirement	Default Value	Possible Value
list	mandatory	none	List of indexers defining SQL transformations and processing options

Additional Information:

index_table: Name of the indexed table.
base_sql: SQL query for extracting indexed data.
options: Additional indexing settings (e.g., start time, batch size).
throttle: Controls indexing rate with min/max limits.

Example usage:

indexers:
  - index_table: city
    base_sql: |
      SELECT
        city_id,
        zip_code,
        zip_code as id,
        city_name,
        county_name,
        state_code,
        state_name,
        version,
        cast(ts_city as timestamp) as ts_city
      FROM
        city
    options:
      start: 1734979551
      start_query: "SELECT max(source_ts_ms) FROM products"
      step: 86400
      batch_sql: |
        WITH base AS (
            {base_sql}
        ) SELECT
          *
        FROM
          base
        WHERE
          epoch(ts_city) >= {start} AND epoch(ts_city) < {end}
    throttle:
      min: 10000
      max: 60000
      factor: 1.2
      jitter: true

indexers.index_table¶

Description: Specifies the target index table where data will be stored.

Data Type	Requirement	Default Value	Possible Value
string	mandatory	none	Name of an existing index table

Example usage:

index_table: city

indexers.base_sql¶

Description: The SQL query used to extract data from the source dataset before indexing.

Data Type	Requirement	Default Value	Possible Value
string (SQL)	mandatory	none	Valid SQL query

Example usage:

base_sql: |
  SELECT
    city_id,
    zip_code,
    zip_code as id,
    city_name,
    county_name,
    state_code,
    state_name,
    version,
    cast(ts_city as timestamp) as ts_city
  FROM
    city

indexers.options¶

Description: Defines additional indexing parameters such as batch processing settings.

Data Type	Requirement	Default Value	Possible Value
	object	optional	none

indexers.options.start¶

Description: Specifies the starting epoch timestamp for indexing data.

Data Type	Requirement	Default Value	Possible Value
integer	optional	none	UNIX timestamp (e.g., `1734979551`)

Example usage:

start: 1734979551

indexers.options.start_query¶

Description: Specifies a SQL query that returns a single integer value formulated via max(value) to determine the maximum value of the specified numeric column (e.g., row_num, source_ts_ms, etc) in the index. This maximum value is used as the starting point for fetching new records from the source Lakehouse dataset. The query should return the maximum value of a timestamp, row number, or similar incremental column that can be used to identify the most recent indexed data.

Additional Information: This approach provides a dynamic way to determine the starting point for incremental indexing. Instead of manually specifying a fixed timestamp via the start parameter, the system can automatically determine where to begin by executing the provided query against the existing dataset to find the maximum value of the specified column.

Data Type	Requirement	Default Value	Possible Value
string (SQL query)	optional	none	SQL query that returns a single integer (e.g., `"SELECT max(ts_city) FROM city"`, `"SELECT max(row_num) FROM products"`, `"SELECT max(source_ts_ms) FROM products"`)

Example usage:

start_query: "SELECT max(source_ts_ms) FROM products"

indexers.options.step¶

Description: Defines the time step in seconds between batch executions.

Data Type	Requirement	Default Value	Possible Value
integer	optional	none	Time interval in seconds (e.g., `