Introduction

Parquet data stored in object storage, like AWS S3, has become a standard practice in current data lake architecture. And using Trino with Hive Standalone Metastore to query these data is also a very standard practice, as shown here.

What is less well documented is how to deploy these services in a Kubernetes cluster (for example EKS), and adhere to security best practices when establishing the connection between Trino and S3. One way to do this is to use IAM Roles for Service Accounts (IRSA). This allows us to use a kubernetes service account (used by Trino) to get read/write access to specific S3 buckets, through an IAM policy setup.

This post will guide you through the setup process using mostly Terraform (or OpenTofu if you prefer) and Helm. I have drawn heavy inspiration inspiration from here.

This walkthrough assumes intermediate to advanced familiarly with Terraform (OpenTofu), Kubernetes, EKS (and AWS resources in general) and Helm. Some familiarly with Trino and Hive Metastore is also helpful.

The full code can found in Github.

Hive Metastore Container Image

The Trino team provides official container images and Helm chart we can use, so we are covered there (we will be using them later). but there is no official Hive Standalone Metastore container images. The most updated one I could fined was in this EMR on EKS example.

So lets go through the process of creating the container image for the Hive Metastore we can use. In the repo the Dockerfile and the entry point script is in the folder: metastore/image

As of April 2024, version 4.0.0 of Hive has been release, so we shall uses that (together with Hadoop v3.4.0) to make the image a bit more future-proof, and take advantage of Hadoop upgrades to use AWS SDK v2

Dockerfile

I have chosen to use the Red Hat UBI Image as the base image, which is also what the official Trino image uses:

FROM registry.access.redhat.com/ubi9/ubi-minimal:latest

Then we add build args to set the non-root user:

# user and group IDs to run image as
ARG RUN_AS_USER=1000
ARG RUN_AS_GROUP=1000

and setup up the version to use:

# Versions
ARG HSM_VERSION=4.0.0
ARG HADOOP_VERSION=3.4.0
ARG POSTGRES_DRIVER_VERSION=42.7.3

The dependence now have to install are:

  1. shadow-utils to create the non-root user and group
  2. java-17-openjdk-headless for the JVM runtime
  3. tar and gzip to extract the Hive and Hadoop packages
# update and install java and dependencies
RUN microdnf update -y \
  && microdnf --nodocs install shadow-utils java-17-openjdk-headless tar gzip -y \
  && microdnf clean all -y

Then we setup the non-root user and modifying the ownership of the /opt directory to the non-root user (hsm)

# set up non root user
RUN groupadd -g ${RUN_AS_GROUP} hsm && \
  useradd -u ${RUN_AS_USER} -g hsm hsm

# setup opt dir for hsm user
RUN chown -R hsm:hsm /opt

USER hsm

Setting all the relevant environment variable:

# Set Hadoop/HiveMetastore Classpath
ENV JAVA_HOME=/usr/lib/jvm/jre-17
ENV HADOOP_HOME="/opt/hadoop"
ENV METASTORE_HOME="/opt/hive-metastore"
ENV HIVE_HOME="/opt/hive-metastore"
ENV HADOOP_CLASSPATH="${HADOOP_HOME}/share/hadoop/tools/lib/*:${HADOOP_HOME}/share/hadoop/common/lib/*"

Installing Hadoop to the /opt directory:

# Download Hadoop
RUN curl https://dlcdn.apache.org/hadoop/common/hadoop-$HADOOP_VERSION/hadoop-$HADOOP_VERSION.tar.gz \
    | tar xz -C /opt/  \
    && ln -s ${HADOOP_HOME}-$HADOOP_VERSION ${HADOOP_HOME} \
    && rm -r ${HADOOP_HOME}/share/doc

Installing Hive Standalone Metastore:

RUN curl https://repo1.maven.org/maven2/org/apache/hive/hive-standalone-metastore-server/${METASTORE_VERSION}/hive-standalone-metastore-server-${METASTORE_VERSION}-bin.tar.gz \
    | tar xz -C /opt/ \
    && ln -s /opt/apache-hive-metastore-${METASTORE_VERSION}-bin ${METASTORE_HOME} \
    # fix for schemaTool script
    && sed -i -e 's/org.apache.hadoop.hive.metastore.tools.MetastoreSchemaTool/org.apache.hadoop.hive.metastore.tools.schematool.MetastoreSchemaTool/g' ${METASTORE_HOME}/bin/ext/schemaTool.sh

NOTE: As of Hive 4.0.0, there seems to be a bug in the schemaTool.sh script here, where the Java class is not correct. Hence the sed command to fix it to the correct class.

Download and setup Postgres JDBC driver for connection to the :

RUN curl https://repo1.maven.org/maven2/org/postgresql/postgresql/${POSTGRES_DRIVER_VERSION}/postgresql-${POSTGRES_DRIVER_VERSION}.jar \
    -o ${METASTORE_HOME}/lib/postgresql-${POSTGRES_DRIVER_VERSION}.jar

Finally we switch to the METASTORE_HOME directory, copy in the entrypoint script and setup CMD to run the entrypoint script

WORKDIR ${METASTORE_HOME}
COPY --chown=hsm:hsm --chmod=775 entrypoint.sh bin/entrypoint.sh
CMD ["bash", "-c", "bin/entrypoint.sh"]

Container Entrypoint

For the entrypoint.sh script, we setup 2 functions, a logging function, taken from here:

function log () {
    level=$1
    message=$2
    echo $(date  '+%d-%m-%Y %H:%M:%S') [${level}]  ${message}
}

And an initSchema function:

function initSchema () {
  log "INFO" "checking DB schemas"
  if ${METASTORE_HOME}/bin/schematool -info -dbType postgres
  then
    log "INFO" "scheme found in DB"
  else
    log "INFO" "schema not found DB, running initSchema"
    ${METASTORE_HOME}/bin/schematool -initSchema -dbType postgres
  fi
}

this function uses the schemaTool script provided in the Hive Standalone Metastore package, to check if the DB schema and tables are present, and if not, create them.

Finally, we call initSchema and then actually start the Metastore, with some error handling

if initSchema 
then 
  log "INFO" "starting metastore"
  ${METASTORE_HOME}/bin/start-metastore
else 
  log "ERROR" "error checking schema or running initSchema"
  exit 1
fi

AWS Resources

The the AWS resources that we need are:

  1. S3 bucket, and associated IAM polices for access.
  2. The VPC to hold all the resources
  3. EKS cluster with IAM Role to assume to access the S3 bucket from the EKS cluster
  4. RDS for the hive metastore

We will create all of these using Terraform (or OpenTofu if you prefer, as I did). In the repo the Terraform file can be found in the folder: terraform/aws-resources.

Setup

First we need to setup the provider versions (AWS), in versions.tf:

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "5.43.0"
    }
  }
}

And set up the input variables in variables.tf (with descriptions):

variable "name" {
  type        = string
  description = "Common name of the S3 bucket, EKS cluster, RDS instance and other resources"
}

variable "region" {
  type        = string
  description = "AWS region to use"
}

variable "vpc_cidr" {
  type        = string
  description = "VPC CIDR block to use"

}

variable "kube_namespace_name" {
  type        = string
  description = "Kubernetes Namespace name where the Trino and Metastore deployments will be done to"

}

variable "kube_sa_name" {
  type        = string
  description = "Kubernetes Service account name, which will be used to access S3 using IAM/IRSA"
}

variable "cluster_endpoint_public_access_cidrs" {
  type        = list(string)
  description = "List of CIDR blocks which can access the Amazon EKS public API server endpoint"
}

variable "kubeconfig_location" {
  type        = string
  description = "Location to save the Kubeconfig file to"
}

variable "enable_eks" {
  type        = bool
  default     = true
  description = "Turn on or off the EKS resources"
}

variable "enable_rds" {
  type        = bool
  default     = true
  description = "Turn on or off the RDS resources"
}

For the variable cluster_endpoint_public_access_cidrs, we need to specify the CIDR blocks/IP addresses of the machines from which we need be able to connect to the EKS API endpoint (to run kubectl commands). This connection would be needed to deploy the Kubernetes resources later. So we should put in the CIDR block/IP address of the machine that we are using ot run these terraform scripts from. We would also be reusing this to allow IPs that can access the Metastore RDS instance.

This is the example terraform.tfvars I used to set the variable values (this file should not be committed to git, as per best practices):

name                                 = "trino-on-eks"
region                               = "ap-southeast-1"
vpc_cidr                             = "10.0.0.0/24"
kube_namespace_name                  = "trino"
kube_sa_name                         = "s3-access"
cluster_endpoint_public_access_cidrs = ["your_ip_here/32"]
kubeconfig_location                  = "../../local/kubeconfig.yaml"
enable_eks                           = true
enable_rds                           = true

Of course do put in your own IP for cluster_endpoint_public_access_cidrs.

All the main resources are defined in the main.tf file,

First we get the information of all the availability zones (azs) in the our region:

data "aws_availability_zones" "available" {}

Then we calculate the CIDR blocks for the azs using the azs information, and setup the common tags in the locals:

locals {
  azs = slice(data.aws_availability_zones.available.names, 0, 3)

  tags = {
    role = var.name
  }
}

Then we setup the AWS provider with the region value from locals:

provider "aws" {
  region = var.region
}

S3 Bucket

For the S3 resource, we first setup the main S3 bucket:

resource "aws_s3_bucket" "trino_on_eks" {
  bucket = var.name
  tags   = local.tags
}

We then create the policy document which specifies read/write and list permission on that bucket only:

data "aws_iam_policy_document" "trino_s3_access" {
  statement {
    actions = [
      "s3:ListBucket"
    ]
    resources = [aws_s3_bucket.trino_on_eks.arn]
  }

  statement {
    actions = [
      "s3:PutObject",
      "s3:GetObject",
      "s3:DeleteObject"
    ]
    resources = ["${aws_s3_bucket.trino_on_eks.arn}/*"]
  }
}

And assign this policy document to the actual IAM policy and create that:

resource "aws_iam_policy" "trino_s3_access_policy" {
  name   = "trino_s3_access_policy"
  path   = "/"
  policy = data.aws_iam_policy_document.trino_s3_access.json
}

This policy will be how the service in EKS (Trino and Hive metastore) will get access to the S3 bucket, as we will see later.

VPC

Using the VPC module we create the VPC, using the values from locals:

module "vpc" {
  source = "terraform-aws-modules/vpc/aws"

  name = var.name
  cidr = var.vpc_cidr

  azs                     = local.azs
  public_subnets          = [for k, v in local.azs : cidrsubnet(var.vpc_cidr, 4, k)]
  enable_dns_support      = true
  enable_dns_hostnames    = true
  map_public_ip_on_launch = true

  tags = local.tags
}

For the sake of simplicity and to save cost, we are only creating public subnets. However it is recommended to use private subnet with a NAT gateway for th EKS cluster and RDS.

EKS Cluster resources

Using the EKS module, we create the EKS cluster in the public subnets of the VPC we created:

module "eks" {
  count = var.enable_eks ? 1 : 0

  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.0"

  cluster_name    = var.name
  cluster_version = "1.29"

  cluster_endpoint_private_access      = true
  cluster_endpoint_public_access       = true
  cluster_endpoint_public_access_cidrs = var.cluster_endpoint_public_access_cidrs

  cluster_addons = {
    coredns = {
      most_recent = true
    }
    kube-proxy = {
      most_recent = true
    }
    vpc-cni = {
      most_recent = true
    }
  }

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.public_subnets

  eks_managed_node_groups = {
    trino = {
      min_size     = 1
      max_size     = 3
      desired_size = 1

      instance_types = ["t3.medium"]
      capacity_type  = "SPOT"
    }
  }

  # Cluster access entry
  # To add the current caller identity as an administrator
  enable_cluster_creator_admin_permissions = true

  tags = local.tags
}

I have chosen to use t3.medium spot instances, with a max size of 3. As hinted when creating the variables, To be able to access this cluster’s API endpoint (to run kubectl commands), we have to set cluster_endpoint_public_access to true and specify the public access CIDRs in cluster_endpoint_public_access_cidrs, which is coming from the variables.

Then we create the IAM role that connects the IAM policies for the S3 bucket access we created before to a service account in he EKS cluster. We can use this submodule:

module "trino_s3_access_irsa" {
  count = var.enable_eks ? 1 : 0

  source = "terraform-aws-modules/iam/aws//modules/iam-assumable-role-with-oidc"

  create_role                   = true
  role_name                     = "trino_s3_access_role"
  provider_url                  = module.eks[0].oidc_provider
  role_policy_arns              = [aws_iam_policy.trino_s3_access_policy.arn]
  oidc_fully_qualified_subjects = ["system:serviceaccount:${var.kube_namespace_name}:${var.kube_sa_name}"]
}

This assigns the service account specified in var.kube_sa_name in the namespace specified in var.kube_namespace_name the permissions to assume this created role (through the cluster’s OIDC provider), which in turn has been assigned the S3 bucket access polices created before.

And finally we need to save the kubeconfig file for access to the cluster. For this first we setup the kubeconfig yaml template:

apiVersion: v1
preferences: {}
kind: Config

clusters:
- cluster:
    server: ${endpoint}
    certificate-authority-data: ${clusterca}
  name: ${cluster_name}

contexts:
- context:
    cluster: ${cluster_name}
    user: ${cluster_name}
  name: ${cluster_name}

current-context: ${cluster_name}

users:
- name: ${cluster_name}
  user:
    exec:
      apiVersion: client.authentication.k8s.io/v1beta1
      command: aws
      args:
      - --region
      - ${region}
      - eks
      - get-token
      - --cluster-name
      - ${cluster_name}
      - --output
      - json

In the repo this is located at terraform/aws-resources/templates/kubeconfig.tpl.

This template is used to create the final kubeconfig file using the local_sensitive_file resource:

resource "local_sensitive_file" "kubeconfig" {
  count = var.enable_eks ? 1 : 0

  content = templatefile("${path.module}/templates/kubeconfig.tpl", {
    cluster_name = module.eks[0].cluster_name,
    clusterca    = module.eks[0].cluster_certificate_authority_data,
    endpoint     = module.eks[0].cluster_endpoint,
    region       = var.region
  })
  filename = var.kubeconfig_location
}

The location where this is saved to is set by the variable kubeconfig_location. Do be careful not to save it where it might get accidentally committed and pushed to git.

RDS Resources for Metastore

As the Metastore needs a DB, I chose to use RDS postgres as the simple straightforward solution. First we create a random password, and save it to AWS Secret Manager

resource "random_password" "rds_password"{
  length           = 16
  special          = true
  override_special = "_!%^"
}

resource "aws_secretsmanager_secret" "rds_password" {
  name = "trino-on-eks-rds-password"
}

resource "aws_secretsmanager_secret_version" "rds_password" {
  secret_id     = aws_secretsmanager_secret.rds_password.id
  secret_string = random_password.rds_password.result
}

Then we setup the DB subnets, which is basically the public subnets of the VPC we created:

resource "aws_db_subnet_group" "trino_on_eks" {
  name       = var.name
  subnet_ids = module.vpc.public_subnets

  tags = local.tags
}

Similar to our EKS cluster, we are putting the RDS instance in the public subnet, for simplicity and cost. We should ideally be putting this too in a private subnet behind a NAT gateway.

Then we setup the security group which should only allow ingress from within the VPC, and from our own machine (to query the RDS to check). Thus the ingress CIDR blocks will be combination of the VPC CIDR and reusing the variable cluster_endpoint_public_access_cidrs for the local machine:

resource "aws_security_group" "trino_on_eks_rds" {
  name   = "trino-on-eks-rds"
  vpc_id = module.vpc.vpc_id

  ingress {
    from_port   = 5432
    to_port     = 5432
    protocol    = "tcp"
    cidr_blocks = concat([var.vpc_cidr], var.cluster_endpoint_public_access_cidrs)
  }

  egress {
    from_port   = 5432
    to_port     = 5432
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = local.tags
}

Then we setup some DB parameter:

resource "aws_db_parameter_group" "trino_on_eks_rds" {
  name   = var.name
  family = "postgres16"

  parameter {
    name  = "log_connections"
    value = "0"
  }
}

and finally the RDS instance itself:

resource "aws_db_instance" "trino_on_eks_rds" {
  count = var.enable_rds ? 1 : 0

  identifier             = var.name
  instance_class         = "db.t4g.micro"
  allocated_storage      = 10
  engine                 = "postgres"
  engine_version         = "16.2"
  db_name                = "trino_on_eks"
  username               = "trino_on_eks"
  password               = random_password.rds_password.result
  db_subnet_group_name   = aws_db_subnet_group.trino_on_eks.name
  vpc_security_group_ids = [aws_security_group.trino_on_eks_rds.id]
  parameter_group_name   = aws_db_parameter_group.trino_on_eks_rds.name
  publicly_accessible    = true
  skip_final_snapshot    = true
}

With all these resources, we are now ready to deploy our application into the EKS cluster.

Kubernetes (EKS) Resources

We separate the kubernetes resources from the AWS (infra) resources, to avoid dependency issues between the AWS and Kubernetes providers.

So, although we are still using terraform to deploy the kube resources, these will be done in a separate folder, with a separate state. In the repo these can be found in: terraform/kube-resources.

Setup

As before, we start with defining the provider versions in the versions.tf file:

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "5.43.0"
    }
    kubernetes = {
      source  = "hashicorp/kubernetes"
      version = "~> 2.27.0"
    }
  }
}

We are still defining the AWS provider, because we will use these to get information about the EKS cluster and RDS instance we setup previously.

Then we set up the input variables in variables.tf (with descriptions):

variable "name" {
  type        = string
  description = "Common name of the S3 bucket, EKS cluster, RDS instance and other resources"
}

variable "region" {
  type        = string
  description = "AWS region to use"
}

variable "kube_namespace_name" {
  type        = string
  description = "Kubernetes Namespace name where the Trino and Metastore deployments will be done to"

}

variable "kube_sa_name" {
  type        = string
  description = "Kubernetes Service account name, which will be used to access S3 using IAM/IRSA"
}

With example terraform.tfvars (as before, careful not to commit this):

name                = "trino-on-eks"
region              = "ap-southeast-1"
kube_namespace_name = "trino"
kube_sa_name        = "s3-access"

The values should match what was set as variable values in the AWS resources Terraform.

Then, in the main.tf, with the AWS provider, we get the following info:

  1. EKS cluster general and auth info:
data "aws_eks_cluster" "trino_on_eks" {
  name = var.name
}

data "aws_eks_cluster_auth" "trino_on_eks" {
  name = var.name
}
  1. Info about the role that can be used to access the S3 bucket
data "aws_iam_role" "trino_s3_access_role" {
  name = "trino_s3_access_role"
}
  1. RDS DB general and auth information:
data "aws_db_instance" "trino_on_eks_rds" {
  db_instance_identifier = var.name
}

data "aws_secretsmanager_secret" "rds_password" {
  name = "trino-on-eks-rds-password"

}

data "aws_secretsmanager_secret_version" "rds_password" {
  secret_id = data.aws_secretsmanager_secret.rds_password.id
}

So with the EKS cluster we can setup the kubernetes and helm providers:

provider "kubernetes" {
  host                   = data.aws_eks_cluster.trino_on_eks.endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.trino_on_eks.certificate_authority[0].data)
  token                  = data.aws_eks_cluster_auth.trino_on_eks.token
}

provider "helm" {
  kubernetes {
    host                   = data.aws_eks_cluster.trino_on_eks.endpoint
    cluster_ca_certificate = base64decode(data.aws_eks_cluster.trino_on_eks.certificate_authority[0].data)
    token                  = data.aws_eks_cluster_auth.trino_on_eks.token
  }
}

Namespace

The namespace is the first kubernetes resource we create, to hold all the other resources:

resource "kubernetes_namespace" "trino" {
  metadata {
    name = var.kube_namespace_name
  }
}

Service Account

Then we create the service account, which will be the once that will assume the role created previously to allow access to the S3 bucket. This is done by annotating the service account resource with the role ARN.

resource "kubernetes_service_account" "trino_s3_access_sa" {
  depends_on = [kubernetes_namespace.trino]
  metadata {
    name      = var.kube_sa_name
    namespace = var.kube_namespace_name

    annotations = {
      "eks.amazonaws.com/role-arn" = data.aws_iam_role.trino_s3_access_role.arn
    }
  }
}

We will then make sure that the Trino and Metastore deployment use these service account, which will result in EKS Pod Identity Webhook injecting environment variables and secrets volumes to allow the AWS SDKs in the applications to use sts:AssumeRoleWithWebIdentity credential provider to assume the role, authenticate and access the S3 bucket, as descripted in more details here and here.

Metastore Deployment

We will now deploy the Metastore using a Helm chart, which in the repo can be found in the folder: metastore/helm-chart. I will not be going into the details of creating this Helm chart here, as it quite involved, and deserver is own writeup. Do let me know if you would like me to do that write up.

We will use the helm_release Terraform resource to deploy it, with the appropriate values:

resource "helm_release" "metastore" {
  depends_on = [kubernetes_service_account.trino_s3_access_sa]

  name      = "metastore"
  namespace = var.kube_namespace_name
  chart     = "../../metastore/helm-chart"

  set {
    name  = "image"
    value = "ghcr.io/binayakd/metastore:4.0.0-hadoop-3.4.0"
  }

  set {
    name  = "dbUrl"
    value = "jdbc:postgresql://${data.aws_db_instance.trino_on_eks_rds.endpoint}/${data.aws_db_instance.trino_on_eks_rds.db_name}"
  }
  set {
    name  = "dbUser"
    value = data.aws_db_instance.trino_on_eks_rds.master_username
  }

  set {
    name  = "dbPassword"
    value = data.aws_secretsmanager_secret_version.rds_password.secret_string
  }
  set {
    name  = "dbDriver"
    value = "org.postgresql.Driver"
  }

  set {
    name  = "s3Bucket"
    value = "s3://trino-on-eks"
  }

  set {
    name  = "serviceAccountName"
    value = var.kube_sa_name
  }
}

The first value set is the Metastore image, the creation of which we went through previously.

NOTE: I have built and pushed this image to Github registry. However, since it has full Hadoop binaries in it, the image size is pretty huge, and pulling it using public access from Github will take very long. It would be better to build it yourself, and push it into your own container registries.

The next few values are set for access to the RDS instance (connection endpoint, DB name, username and password). These are all gotten from the AWS data resources defined in the setup section.

And finally we set the S3 bucket name, and the service account to use for the Metastore deployment.

Trino Deployment

For the Trino deployment, we will be using the Official Helm charts. We first need to define the extra chart values, which we define in the file terraform/kube-resources/trino-helm-values.yaml:

server:
  workers: 1
additionalCatalogs:
  s3_hive: |-
    connector.name=hive
    hive.metastore.uri=thrift://metastore:9083
    fs.native-s3.enabled=true
    s3.region=ap-southeast-1    

Here, we are setting the trino workers to 1 (since we will not be doing any heavy querying), and adding our Metastore as a an additional Hive catalog, with all the connection details, based on the documentation here. Setting fs.native-s3.enabled=true allows Trino to directly access the files in S3 without having to rely on the Hive libraries in the Metastore instance, as described here

With the values yaml setup, we can again use the helm_release resource to deploy the main Trino cluster:

resource "helm_release" "trino" {
  name      = "trino"
  namespace = var.kube_namespace_name

  repository = "https://trinodb.github.io/charts"
  chart      = "trino"

  values = [
    "${file("trino-helm-values.yaml")}"
  ]

  set {
    name  = "serviceAccount.name"
    value = var.kube_sa_name
  }
}

Again here we are additionally setting the service account to allow Trino access to the S3 bucket.

With all that deployed we should have all the resources deployed for a working setup.

Testing the Setup

Kubectl Context Setup and Connection Test

First step is to check if all the Kube resources have are deployed properly and running. To do this, first we ensure out kubecontex is setup, by setting up the KUBECONFIG env variable to point to the location where we have set terraform to save it. so for example, I have set it to save to a folder called local in the root of my repo (which I have added to my gitignore). So we can setup the context in the terminal like this:

cd ./local
export KUBECONFIG=./kubeconfig.yaml

Then we can check if the pods are running using kubectl get pods command:

$ kubectl get pods
NAME                                READY   STATUS    RESTARTS   AGE
metastore-796bb9dc7d-bl45s          1/1     Running   0          5m30s
trino-coordinator-58ddd58b6-t2v7g   1/1     Running   0          5m30s
trino-worker-69ff875bbb-qs4mc       1/1     Running   0          5m30s

You should see one metastore pod, one Trino coordinator pod, and a number of Trino worker pods (in our case 1)

Since we have not setup any ingress or loadbalancer resource to access Trino (which I will leave as an exercise for another time), we can access it using the kubectl port-forward command:

kubectl port-forward service/trino 8080

This will allow us to access the Trino endpoint on our local machine on port 8080.

So, using your faviroute DB client (I am using Dbeaver), we can connect to Trino: Dbeaver Trino Connection Setup

Once its successfully connected, we can see the s3_hive catalog (together with some default ones), which we configured in the Trino Helm values to connect to the Metastore: Trino Schema list

Loading Sample Data

So now that we have out Trino connection setup, we need some data to run queries on. We could either use Trino to write some data into out datastore, or we can register some existing files in S3 as an external table.

Lets do the latter, using the well-known Iris Dataset. I will use the parquet version, but CSV also works.

So we load it into our S3 bucket:

$ aws s3 cp ./iris.parquet s3://trino-on-eks/iris.parquet/iris.parquet
upload: ./iris.parquet to s3://trino-on-eks/iris.parquet/iris.parquet

NOTE: the file has to be in a “subfolder” level in S3, as Trino will only be able to create the table at the folder level. In this case the table will be created in teh path s3://trino-on-eks/iris.parquet

Now that we have some data in out S3 bucket, we first create a schema, with a location popery of the S3 bucket:

CREATE SCHEMA s3_hive.test
WITH (location = 's3a://trino-on-eks/')

NOTE: The S3 URL starts with s3a (not s3) which is used by a spacial S3 client which is actually part of the hadoop-aws module.

Then we create the actual table that registers the parquet file location:

CREATE TABLE s3_hive.test.iris (
  id DOUBLE,
  sepal_length DOUBLE,
  sepal_width DOUBLE,
  petal_length DOUBLE,
  petal_width DOUBLE
)
WITH (
  format = 'PARQUET',
  external_location = 's3a://trino-on-eks/iris.parquet'
)

Finally, when we do a select query on this table:

select * from s3_hive.test.iris;

We can see the results: Query Results