Skip to content

Lakesearch troubleshootingΒΆ

After deploying the Lakesearch Service, if its runtime state remains unchanged for an extended period, it may have encountered an issue. Users can check the Service logs to identify potential errors and resolve them.

Below are some common errors and their respective solutions.

id is not present in the index_tableΒΆ

If you forget to add the additional column id while configuring the Lakesearch Service, the runtime status will remain pending, and the logs will display the following message:

dataos-ctl log -t service -n testingls -w public -r

INFO[0000] πŸ“ƒ log(public)...                             
INFO[0001] πŸ“ƒ log(public)...complete                     

              NODE NAME             β”‚     CONTAINER NAME     β”‚ ERROR  
────────────────────────────────────┼────────────────────────┼────────
  testingls-eacp-d-7f4ccb75d8-rtflh β”‚ testingls-eacp-indexer β”‚        

-------------------LOGS-------------------
10:42AM INF pkg/config/config.go:126 > Loading config... file:///etc/dataos/config/lakesearch.yaml [success]
10:42AM FTL pkg/config/lakesearch.go:227 > column `id` is not present in the index_table: newcity columns schema, mandatory column.

Steps to resolve:

  1. In the Lakesearch Service manifest file, add an additional column in the index_tables.columns section as shown below.

          index_tables:
            - name: newcity
              description: "index for cities"
              tags:
                - cities
              properties:
                morphology: stem_en
              columns:
                - name: city_id
                  type: keyword
                - name: zip_code
                  type: bigint  
                - name: id                         #added
                  description: "mapped to row_num"
                  tags:
                    - identifier
                  type: bigint
                - name: city_name
                  type: keyword
                - name: county_name
                  type: keyword
                - name: state_code
                  type: keyword
                - name: state_name
                  type: text
                - name: version
                  type: text
                - name: ts_city
                  type: timestamp
    
  2. Corresponding to this, add an additional column in which the primary key is identified as id in the indexer.base_sql section.

          indexers:
            - index_table: newcity
              base_sql: |
                SELECT 
                  city_id,
                  zip_code,
                  zip_code as id,  # added
                  city_name,
                  county_name,
                  state_code,
                  state_name,
                  version,
                  cast(ts_city as timestamp) as ts_city
                FROM 
                  city
    
  3. Reapply the LakeSearch Service by running the command below. It will update automatically, eliminating the need to delete the existing Service.

    dataos-ctl resource apply -f ${{path-to-the-manifest-file}}
    

    Expected output:

    INFO[0000] πŸ›  apply...                                   
    INFO[0000] πŸ”§ applying(public) testingls:v1:service...   
    INFO[0002] πŸ”§ applying(public) testingls:v1:service...updated 
    INFO[0002] πŸ›  apply...complete  
    
  4. If no further issues are found, the Service runtime state will change to "running.”

    dataos-ctl get -t service -n testingls -w public -r                 
    INFO[0000] πŸ” get...                                     
    INFO[0000] πŸ” get...complete                             
    
        NAME    | VERSION |  TYPE   | WORKSPACE | STATUS |  RUNTIME  |    OWNER     
    ------------|---------|---------|-----------|--------|-----------|--------------
      testingls | v1      | service | public    | active | running:1 | iamgroot  
    

Depot type GCS not supportedΒΆ

If you deploy LakeSearch in a DataOS environment built on GCS, the Service will remain in a pending state, as LakeSearch is only supported in environments built on Azure or AWS.

Below is an example of the indexer logs when attempting to run a LakeSearch service in a GCS-based environment.

 dataos-ctl log -t service -n ls-test-query-rewrite -w public -r
INFO[0000] πŸ“ƒ log(public)...                             
INFO[0001] πŸ“ƒ log(public)...complete                     

                    NODE NAME                   β”‚           CONTAINER NAME           β”‚ ERROR  
────────────────────────────────────────────────┼────────────────────────────────────┼────────
  ls-test-query-rewrite-geh6-d-647487674f-7kgkk β”‚ ls-test-query-rewrite-geh6-indexer β”‚        

-------------------LOGS-------------------
10:52AM INF pkg/config/config.go:126 > Loading config... file:///etc/dataos/config/lakesearch.yaml [success]
10:52AM INF cmd/main.go:390 > starting python gRPC server on :4090...
10:52AM WRN cmd/main.go:443 > Health check failed error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:4090: connect: connection refused\""
10:52AM INF cmd/main.go:444 > Retrying... Waiting for Python gRPC server to be ready (1/100)
2025-03-06 10:52:43,434 - WARNING - No requirements file provided. Proceeding without installation.
2025-03-06 10:52:43,435 - INFO - Successfully imported ex_impl_query_rewriter
2025-03-06 10:52:43,435 - INFO - Server is running on address [::]:4090...
10:52AM INF cmd/main.go:440 > Python gRPC server is healthy and ready.
10:52AM DBG cmd/main.go:465 > embedder exists: false
10:52AM DBG cmd/main.go:481 > query rewriter exists: true
10:52AM INF pkg/source/connection.go:33 > Source setup... Source=datasets
10:52AM DBG pkg/source/connection.go:104 > [544.123225ms] << SET azure_transport_option_type = 'curl'; Source=datasets took=544.123225
10:52AM DBG pkg/source/connection.go:104 > [373.267Β΅s] << SET extension_directory='/extensions'; Source=datasets took=0.373267
10:52AM DBG pkg/source/connection.go:104 > [3.798574ms] << LOAD httpfs; Source=datasets took=3.798574
10:52AM DBG pkg/source/connection.go:104 > [3.611732ms] << LOAD aws; Source=datasets took=3.611732
10:52AM DBG pkg/source/connection.go:104 > [242.912Β΅s] << LOAD azure; Source=datasets took=0.242912
10:52AM DBG pkg/source/connection.go:104 > [3.136989ms] << LOAD iceberg; Source=datasets took=3.136989
10:52AM DBG pkg/source/connection.go:119 > [752.606Β΅s] << SELECT extension_name FROM duckdb_extensions() WHERE loaded = true Source=datasets took=0.752606
10:52AM DBG pkg/source/duckdb.go:52 > aws
10:52AM DBG pkg/source/duckdb.go:52 > azure
10:52AM DBG pkg/source/duckdb.go:52 > httpfs
10:52AM DBG pkg/source/duckdb.go:52 > iceberg
10:52AM DBG pkg/source/duckdb.go:52 > jemalloc
10:52AM DBG pkg/source/duckdb.go:52 > json
10:52AM DBG pkg/source/duckdb.go:52 > parquet
10:52AM FTL pkg/source/duckdb.go:94 > depot type=gcs not supported

Unspported ColumnΒΆ

If the column used for filtering is mistakenly assigned a text data type, an error will occur when accessing the endpoint.

API endpoint
API endpoint

Steps to resolve

  1. To fix this, delete that Service by excuting the below command.

    dataos-ctl delete -f /home/iamgroot/work/lakesearch/service.yaml
    
  2. Update the data type.

              columns:
                - name: customer_id
                  type: bigint
                - name: birth_year
                  type: bigint  
                - name: id
                  description: "mapped to customer_id"
                  tags:
                    - identifier
                  type: bigint
                - name: education
                  type: text
                - name: country  # column by which data will be filtered
                  type: keyword   #updated data type
                - name: created_at
                  type: timestamp
    
  3. Replace the data_dir and directory paths with new ones.

      envs:
        LAKESEARCH_SERVER_NAME: "public:ls-test-query-rewrite"
        DATA_DIR: public/ls-test-query-rewrite/data01
        USER_MODULES_DIR: /etc/dataos/config
      persistentVolume:
        name: ls-test-vol
        directory: public/ls-test-query-rewrite/data01
    
  4. And re-apply the Service by executing the below command.

    dataos-ctl apply -f /home/iamgroot/work/lakesearch/service.yaml
    

Now data can be filtered by "country":"India", as shown below.

API endpoint
API endpoint
Was this page helpful?