Registering Parquet Files Into Iceberg Tables Without Rewrites Using Pyiceberg

Introduction In my last post, I explored the fundamentals of how to create Apache Iceberg tables, using various catalogs, and how to use Spark and Trino to write and read data into and from these Iceberg tables. That involved using Spark as the the Iceberg client to write data into Iceberg table. However, in the case that data is already in object storage, following this process to create Iceberg tables, would involve a full migration (read, write, delete) of the data, which can prove time consuming and costly for large datasets....

December 25, 2024 · 13 min · 2747 words · Binayak Dasgupta

Exploring Apache Iceberg

Introduction With the recent buzz around Apache Iceberg Tables, I am cashing in on the this buzz to explore what Apache Iceberg is all about, exploring the iceberg that is Apache Iceberg, if you will. The way I see it, Iceberg provides improvements over the Hive table format, which itself has been used to bring relational database table like interface on top of “unstructured” data, in distributed storage. It is better because, not only does it have some additional features (Schema evolution, hidden partitioning, snapshots, improved performance, etc....

September 28, 2024 · 30 min · 6370 words · Binayak Dasgupta

Trino on AWS EKS with IAM/IRSA

Introduction Parquet data stored in object storage, like AWS S3, has become a standard practice in current data lake architecture. And using Trino with Hive Standalone Metastore to query these data is also a very standard practice, as shown here. What is less well documented is how to deploy these services in a Kubernetes cluster (for example EKS), and adhere to security best practices when establishing the connection between Trino and S3....

May 30, 2024 · 18 min · 3713 words · Binayak Dasgupta