Worker: First Steps¶

Create a Worker¶

A Worker Resource instance can be deployed by applying the manifest file. But before creating a Worker Resource, ensure you have required use-cases assigned.

Get Appropriate Access Permission Use Case¶

In DataOS, different actions require specific use cases that grant the necessary permissions to execute a task. You can grant these use cases directly to a user or group them under a tag, which is then assigned to the user. The following table outlines various actions related to Worker Resources and the corresponding use cases required:

Action	Required Use Cases
Get	Read Workspaces, Read Resources in User Specified Workspace / Read Resources in User Workspaces (for public and sandbox workspaces)
Create	Create and Update Resources in User Workspace
Apply	Create and Update Resources in User Workspace
Delete	Delete Resources in User Workspace
Log	Read Resource Logs in User Workspace

To assign use cases, you can either contact the DataOS Operator or create a Grant Request by creating a Grant Resource. The request will be validated by the DataOS Operator.

Create a manifest file¶

To create a Worker Resource, data developers can define a set of attributes in a manifest file, typically in YAML format, and deploy it using the DataOS Command Line Interface (CLI) or API. Below is a sample manifest file for a Worker Resource:

Sample Worker manifest

sample_worker.yml

# RESOURCE META SECTION
# Attributes commmon across all DataOS Resources
name: bento3-worker-sample # Name of the Worker Resource (mandatory)
version: v1beta # Version of the Worker Resource (mandatory)
type: worker # Type of the resource, in this case, it is a worker (optional)
tags:
  - worker 
  - dataos:type:resource 
  - dataos:resource:worker 
  - dataos:layer:user 
  - dataos:workspace:public 
description: Random User Console # Description of the Worker Resource (optional)

# WORKER-SPECIFIC SECTION
# Attributes specific to Worker resource-type
worker: 
  tags: # Worker-specific tags
    - worker 
  highAvailabilityConfig:
    level: hostname #hostname/region/zone
    mode: preferred #preferred/required
  replicas: 2 # Specifies the number of worker replicas to run
  stack: bento-worker:3.0 # Specifies the stack name and version for the worker
  logLevel: DEBUG # Sets the logging level to DEBUG
  compute: runnable-default # Defines the Compute Resource to be used
  resources:
    requests:
        cpu: 100m # Requests 100 millicores of CPU
        memory: 128Mi # Requests 128 MiB of memory
    limits:
        cpu: 1000m # Limits the worker to 1000 millicores of CPU
        memory: 1024Mi # Limits the worker to 1024 MiB of memory

# STACK-SPECIFIC SECTION
# Attributes specific to the choosen Stack
  stackSpec:
    input:
      http_client:
        headers:
          Content-Type: application/octet-stream # Sets the content type header
        url: https://randomuser.me/api/ # URL to fetch data from
        verb: GET # HTTP method to use (GET)
    output:
      stdout:
        codec: "delim:\n-----------GOOD------------\n"

The manifest file for a Worker Resource consists of three main sections, each requiring specific configuration:

Resource Meta Section: Contains attributes shared among all types of Resources.
Worker-Specific Section: Includes attributes unique to the Worker Resource.
Stack-Specific Section: Encompasses attributes specific to the Stack orchestrated by the Worker, which can vary depending on the Stack.

Resource meta section¶

In DataOS, a Worker is categorized as a Resource type. The YAML configuration file for a Worker Resource includes a Resource meta section, which contains attributes shared among all Resource types.

The following YAML excerpt illustrates the attributes specified within this section:

SyntaxExample

name: ${{resource_name}} # Name of the Resource (mandatory)
version: v1beta # Manifest version of the Resource (mandatory)
type: worker # Type of Resource (mandatory)
tags: # Tags for categorizing the Resource (optional)
  - ${{tag_example_1}} 
  - ${{tag_example_2}} 
description: ${{resource_description}} # Description (optional)
owner: ${{resource_owner}} # Owner of the Resource (optional, default value: user-id of user deploying the resource)
layer: ${{resource_layer}} # DataOS Layer (optional, default value: user)

name: my-first-worker # Name of the Resource
version: v1beta # Manifest version of the Resource
type: worker # Type of Resource
tags: # Tags for categorizing the Resource
  - dataos:worker 
  - worker 
description: Common attributes applicable to all DataOS Resources # Description
owner: iamgroot # Owner of the Resource
layer: user # DataOS Layer

To configure a Worker Resource, replace the values of name, layer, tags, description, and owner with appropriate values. For additional configuration information about the attributes of the Resource meta section, refer to the link: Attributes of Resource meta section.

Worker-specific section¶

The Worker-specific section of a manifest file encompasses attributes specific to the Worker Resource.

Basic configurationAdvanced configuration

SyntaxExample

worker:
  stack: ${{worker_stack}} # Stack used (mandatory) 
  compute: ${{compute_resource_name}} # Compute configuration (mandatory)
  stackSpec: # Stack-specific section
    # Attributes specific to the chosen Stack are declared here

worker:
  stack: bento # Stack used (mandatory) 
  compute: runnable-default # Compute configuration (mandatory)
  stackSpec: # Stack-specific section (mandatory)
    # Attributes specific to the chosen stack are declared here

Syntax

worker: # Worker-specific configuration
  title: ${{worker_title}} # Title of the worker 
  tags:
    - ${{worker_tag1}} # Tags for the worker 
    - ${{worker_tag2}} # Additional tags 
  highAvailabilityConfig:
    level: hostname #hostname/region/zone
    mode: preferred #preferred/required
  replicas: ${{worker_replicas}} # Number of replicas 
  autoscaling: # Autoscaling configuration
    enabled: ${{autoscaling_enabled}} # Enable or disable autoscaling 
    minReplicas: ${{min_replicas}} # Minimum number of replicas 
    maxReplicas: ${{max_replicas}} # Maximum number of replicas 
    targetMemoryUtilizationPercentage: ${{memory_utilization}} # Target memory utilization percentage 
    targetCPUUtilizationPercentage: ${{cpu_utilization}} # Target CPU utilization percentage 
  stack: ${{worker_stack}} # Stack used (mandatory) 
  logLevel: ${{log_level}} # Logging level 
  configs: # Configuration settings
    ${{config_key1}}: ${{config_value1}} # Example configuration 
    ${{config_key2}}: ${{config_value2}} # Additional configuration 
  envs: # Environment variables
    ${{env_key1}}: ${{env_value1}} # Example environment variable 
    ${{env_key2}}: ${{env_value2}} # Additional environment variable 
  secrets: 
    - ${{secret_name}} # List of secrets 
  dataosSecrets: # DataOS secrets configuration
    - name: ${{secret_name}} # Name of the secret 
      workspace: ${{secret_workspace}} # Workspace 
      key: ${{secret_key}} # Key 
      keys: 
        - ${{secret_key1}} # List of keys 
        - ${{secret_key2}} # Additional key 
      allKeys: ${{all_keys_flag}} # Include all keys or not 
      consumptionType: ${{consumption_type}} # Type of consumption 
  dataosVolumes: # DataOS volumes configuration
    - name: ${{volume_name}} # Name of the volume 
      directory: ${{volume_directory}} # Directory 
      readOnly: ${{read_only_flag}} # Read-only flag 
      subPath: ${{volume_subpath}} # Sub-path 
  tempVolume: ${{temp_volume_name}} # Temporary volume 
  persistentVolume: # Persistent volume configuration
    name: ${{persistent_volume_name}} # Name of the volume 
    directory: ${{persistent_volume_directory}} # Directory 
    readOnly: ${{persistent_volume_read_only}} # Read-only flag 
    subPath: ${{persistent_volume_subpath}} # Sub-path 
  compute: ${{compute_resource_name}} # Compute configuration 
  resources: # Resource requests and limits
    requests:
      cpu: ${{cpu_request}} # CPU request 
      memory: ${{memory_request}} # Memory request 
    limits:
      cpu: ${{cpu_limit}} # CPU limit 
      memory: ${{memory_limit}} # Memory limit 
  dryRun: ${{dry_run_flag}} # Dry run flag 
  runAsApiKey: ${{api_key}} # API key for running the worker 
  runAsUser: ${{run_as_user}} # User to run the worker as 
  topology: # Topology configuration
    - name: ${{topology_name}} # Name of the topology 
      type: ${{topology_type}} # Type of the topology 
      doc: ${{topology_doc}} # Documentation link or description 
      properties:
        ${{property_key}}: ${{property_value}} # Example property 
      dependencies: # List of dependencies
        - ${{dependency1}} # Example dependency 
        - ${{dependency2}} # Additional dependency
  stackSpec: 
    # Attributes specific to the chosen Stack

Stack-specific section¶

The Stack-specific section of a manifest file includes attributes unique to the Stack orchestrated by the Worker. A Stack Resource within DataOS allows data developers to integrate custom programming paradigms into the platform while utilizing all the native guarantees provided by DataOS.

While users have the flexibility to bring any Stack that supports long-running orchestration, Stacks such as Bento and Fast Fun, which come out-of-the-box with DataOS, are compatible with Workers.

Stack	Purpose of Worker orchestrated by the Stack
Bento	Bento Stack enables stream data processing. When orchestrated using a Worker Resource, Bento Stack facilitates long-running processes that continuously process stream data and write to a sink of choice.
Fast Fun	Fast Fun is a declarative Stack that enables data sinking from Pulsar-type depots such as Fastbase and systemstream to DataOS Lakehouse storage (or depots supporting the Iceberg file format). While Bento Workers support processing, Fast Fun sinks data as-is without transformation.

The attributes within the stackSpec section vary between Stacks and are determined by the workerConfig attribute within the definition of that specific Stack. Below are sample stackSpec sections for the respective Stacks.

Bento StackFast Fun Stack

Below is a sample stack specification for a Bento Worker, where an end user wants to read data from a specific URL and provide a standard output:

stackSpec:
  input:
    http_client:
      headers:
        Content-Type: ${{content_type}}
      url: ${{http_url}}
      verb: ${{http_verb}}
  output:
    stdout:
      codec: ${{codec}}

To get templates for Bento Workers, click on this link. For details about the information of the Bento Stack, refer to the link: Bento

stackSpec:
  input:
    datasets:
      - ${{input_dataset_udl_address}}
    options:
      cleanupSubscription: ${{cleanup_subscription_or_not}}
      processingGuarantees: ${{processing_guarantees}}
      subscriptionPosition: ${{subscription_position}} 
  output:
    datasets:
      - ${{output_dataset_udl_address}}
    options:
      commitInterval: ${{commit_interval}}
      iceberg:
        properties:
          write.format.default: ${{write_file_format}}
          write.metadata.compression-codec: ${{write_metadata_compression_codec}}
      recordsPerCommit: ${{number_of_records_per_commit}}

To look at a few examples related to the Fast Fun Stack

Manage a Worker¶

Verifying Worker creation¶

To ensure that your Worker has been successfully created, you can verify it in two ways:

Check the name of the newly created Worker in the list of workers created by you in a particular Workspace:

dataos-ctl get -t worker - w <workspace-name>

dataos-ctl get -t worker - w <workspace-name>

Sample

dataos-ctl get -t worker -w curriculum

Alternatively, retrieve the list of all Workers created in the Workspace by appending -a flag:

dataos-ctl get -t worker -w <workspace-name> -a
# Sample
dataos-ctl get -t worker -w curriculum -a

You can also access the details of any created Lakehouse through the DataOS GUI in the Resource tab of the Operations app.

Getting Worker logs¶

Deleting a Worker¶

Use the delete command to remove the Worker from the DataOS environment. As shown below, there are three ways to delete a Worker.

Method 1: Copy the Worker name, version, Resource-type and Workspace name from the output of the get command separated by '|' enclosed within quotes and use it as a string in the delete command.

Command

dataos-ctl delete -i "${identifier string}"

Example

dataos-ctl delete -i "demo-01 | v1beta | worker | public"

Output

INFO[0000] 🗑 delete...
INFO[0001] 🗑 deleting(public) demo-01:v1beta:worker...
INFO[0003] 🗑 deleting(public) demo-01:v1beta:worker...deleted
INFO[0003] 🗑 delete...complete

Method 2: Specify the path of the YAML file and use the delete command.

Command

dataos-ctl delete -f ${manifest-file-path}

Example

dataos-ctl delete -f /home/desktop/connect-city/config_v1alpha.yaml

Output

INFO[0000] 🗑 delete...
INFO[0001] 🗑 deleting(public) demo-01:v1beta:worker...
INFO[0003] 🗑 deleting(public) demo-01:v1beta:worker...deleted
INFO[0003] 🗑 delete...complete

Method 3: Specify the Workspace, Resource-type, and Worker name in the delete command.

Command

dataos-ctl delete -w ${workspace} -t worker -n ${worker name}

Example

dataos-ctl delete -w public -t worker -n demo-01

Output

INFO[0000] 🗑 delete...
INFO[0001] 🗑 deleting(public) demo-01:v1beta:worker...
INFO[0003] 🗑 deleting(public) demo-01:v1beta:worker...deleted
INFO[0003] 🗑 delete...complete

Next steps¶

Your next steps depend upon whether you want to learn about what you want to do, or want to configure a specific Worker further, here are some how to guides to help you with that process:

Learn how to configure the manifest file of a Worker. See Worker: Configuration
Set up workers to sync data from Fastbase topics to Lakehouse dataset. See How to Create a Report Monitor
Set up workers to transform stream data. See How to create a Stream Monitor