Posts

Spark on Kubernetes, with Spark History Server, Minio Object Storage and Dynamic Resource Allocation

Introduction Apache Spark is a popular and powerful open source distributed data processing engine. One of the key strength of Spark is its distributed nature, allowing parallelization of data processing, which can take advantage of horizontally scalable compute infrastructure, like a Kubernetes cluster. Google’s Spark Operator is commonly used in most tutorials available online, as an example of running Spark on Kubernetes. But doing it without the operator helped me understand batter the how it all actually works....

Registering Parquet Files Into Iceberg Tables Without Rewrites Using Pyiceberg

Introduction In my last post, I explored the fundamentals of how to create Apache Iceberg tables, using various catalogs, and how to use Spark and Trino to write and read data into and from these Iceberg tables. That involved using Spark as the the Iceberg client to write data into Iceberg table. However, in the case that data is already in object storage, following this process to create Iceberg tables, would involve a full migration (read, write, delete) of the data, which can prove time consuming and costly for large datasets....

Exploring Apache Iceberg

Introduction With the recent buzz around Apache Iceberg Tables, I am cashing in on the this buzz to explore what Apache Iceberg is all about, exploring the iceberg that is Apache Iceberg, if you will. The way I see it, Iceberg provides improvements over the Hive table format, which itself has been used to bring relational database table like interface on top of “unstructured” data, in distributed storage. It is better because, not only does it have some additional features (Schema evolution, hidden partitioning, snapshots, improved performance, etc....

Trino on AWS EKS with IAM/IRSA

Introduction Parquet data stored in object storage, like AWS S3, has become a standard practice in current data lake architecture. And using Trino with Hive Standalone Metastore to query these data is also a very standard practice, as shown here. What is less well documented is how to deploy these services in a Kubernetes cluster (for example EKS), and adhere to security best practices when establishing the connection between Trino and S3....