AWS Athena¶
Amazon Athena is a serverless, interactive query service that allows you to analyze data stored in Amazon S3 using standard SQL.
Nilus supports Athena as a batch ingestion source, enabling queries over S3 data (typically in Parquet, ORC, or Iceberg format) and movement of results into the DataOS Lakehouse or other supported destinations.
Info
- AWS Athena does not support Depot. To configure connections, use service account credentials provided through the Instance Secret Resource in the URI.
- Contact the DataOS Administrator or Operator to obtain configured URI and other required parameters.
Prerequisites¶
The following are the requirements for enabling Batch Data Movement in AWS Athena:
- S3 Configuration
- Ensure the S3 bucket exists.
- Proper IAM permissions (read, write if staging).
- Encryption and lifecycle policies are recommended.
- Athena Setup
- Workgroup defined (
primary
by default). - Query results location configured in S3.
- Glue Data Catalog is integrated with external tables.
- Workgroup defined (
- IAM Permissions
AmazonAthenaFullAccess
or equivalent (restricted as per least privilege).AmazonS3ReadOnlyAccess
for source buckets.- Glue permissions (
GetDatabase
,GetTable
, etc.).
- Required Parameters
bucket
: S3 bucket containing the dataset (e.g.,my-bucket
ors3://my-bucket
)access_key_id
: AWS access key IDsecret_access_key
: AWS secret access keyregion_name
: AWS region (e.g.,us-east-1
)
- Optional Parameters
session_token
: Temporary AWS session tokenworkgroup
: Athena workgroup (default:primary
)profile
: AWS profile name (Nilus loads credentials from local AWS config)
Sample Workflow Config¶
name: nb-athena-test-01
version: v1
type: workflow
tags:
- workflow
- nilus-batch
description: Nilus Batch Workflow Sample for AWS Athena to DataOS Lakehouse
workspace: research
workflow:
dag:
- name: nb-job-01
spec:
stack: nilus:1.0
compute: runnable-default
logLevel: INFO
resources:
requests:
cpu: 400m
memory: 512Mi
stackSpec:
source:
address: athena://?bucket=my_bucket&profile=my_profile®ion_name=us-east-1&workgroup=analytics
options:
source-table: analytics.orders
incremental-key: updated_at
sink:
address: dataos://testaswlh
options:
dest-table: athena_retail.batch_orders
incremental-strategy: replace
Info
Ensure that all placeholder values and required fields (e.g., connection addresses, slot names, and access credentials) are properly updated before applying the configuration to a DataOS workspace.
Deploy the manifest file using the following command:
Supported Attribute Details¶
Nilus supports the following source options for AWS Athena:
Option | Required | Description |
---|---|---|
source-table |
Yes | Table name (schema.table ) |
incremental-key |
No | Column used for incremental ingestion |
workgroup |
No | Athena workgroup (default: primary ) |
Core Concepts
- Data Storage
- Athena queries external tables defined in AWS Glue Catalog.
- Data is typically stored as columnar Parquet files for efficiency.
- Incremental Loading
- Supported via timestamp/sequential ID columns (e.g.,
last_updated_at
). - Default incremental key:
submission_date_time
.
- Supported via timestamp/sequential ID columns (e.g.,
- Cost Model
- Athena charges per TB scanned.
- Partition pruning and columnar formats significantly reduce cost.