We’ve seen the rise and fall of Apache Hive as the default choice when it comes to data lakes.
Hive lost its footing for a few big reasons.
It was too coupled with Hadoop and Spark
Hive, the query engine, became a painful reminder of the slow MapReduce days of data engineering
It didn’t handle small updates very effectively
These major drawbacks from Hive allowed space for new players to enter the field.
In 2019, I was working with Ryan Blue testing out a new data format called Apache Iceberg.
At the time, I was tasked with migrating a 200 TB/hr pipeline from Hive to Iceberg to “stress test” the format. And it didn’t work! In the very early days of Iceberg, it required much more sorting than it does now. And sorting 200 TBs/hr is pretty much impossible.
A lot has changed in the 6 years since I ran that beta test of Iceberg though. Databricks spent $2B to acquire Tabular and capture more of this unstoppable trend.
They did this even though they had their own competitor format, Delta. Why you might ask?
Growth Trajectories are quite different!
In this article, we’re going to talk about when you should use each format
A quick TLDR in decision making
How Hudi got to incremental updates first
Tooling support comparison
Schema evolution
Partitioning Differences
Optimization and file compaction differences
An analogy to relational databases to let this “sink in”
Quick TLDR of when you should pick each one
The very simplest way of thinking about this:
If you use Databricks, pick Delta because of the proprietary support.
If you have a ton of real-time use cases, pick Hudi because it was developed by Uber, a kappa architecture company.
Otherwise, pick Iceberg for the broader tool support, flexibility and overall well roundedness.
This is an EXTREMELY simplistic view has a billion caveats that I’ll discuss further in this article!
This line of thinking might seem a little bit simplistic but look at the adoption of Hudi:
Amazon Transportation services uses Hudi for real-time logistics
ByteDance uses Hudi for real-time Tiktok recommendations page
Uber created Hudi for real-time ride sharing analytics
Are there companies not on Databricks using Delta successfully? Of course.
Are there companies using Hudi that don’t leverage its streaming features much? Absolutely.
The differences between these technologies is becoming less and less over time.
Incremental Update Support
If you’re using tools like Spark Streaming or Flink, doing incremental updates to the data is critical!
There are two ways to incrementally update data in a data lake, Hudi solved this problem first which is why I recommend it for streaming use cases:
Merge-on-read (i.e. MOR):
Simply add the new data into the table (or add a delete record for deletes). The updates are then “merged” in when you read. This does hurt read performance though. This is much better suited for real-time or streaming use cases!
Copy-on-write (i.e. COW)
This takes the entire updated data file and copies it over. This might even include records that were not updated! This hurts write performance but keeps reads really fast. This strategy is much better suited for batch use cases.
The issue with saying incremental support is why you should pick Hudi is that Delta already supports this and Iceberg 2.0 now has this as a feature too!
One of the things you’ll notice when reading through this article is that the features of these three options are “merging” into each other.
Tooling and file format support
All three of these formats have extremely good support if you’re using Apache Spark or other open source tooling like Trino or Hive.
Beyond those tools, Iceberg seems to be leading in this category. The nice integration it has with Flink (although Delta almost has this ready too) allows companies like Confluent are using Flink + TableFlow to easily create Iceberg tables from Kafka topics! The notion of “data-at-rest” and “data-in-motion” gets a little blurrier.
Delta is a little more opinionated about the file format. You need to commit to Parquet if you’re using Delta. This focus is actually a good thing at the end of the day though because of powerful things like Z-ordering for better data compression.
Iceberg supports Avro and ORC as well as Parquet. This is a common theme you might notice. Iceberg will support more things while being less focused while Delta Lake is more focused on doing things a specific way. (Hudi supports Parquet, Avro, and JSON)
One area where Iceberg is painfully behind Delta is with their library PyIceberg. I tried teaching PyIceberg in my boot camp in January and discovered both bugs and feature gaps. This is much different from delta-rs, a Rust library for interacting with Delta table that has Python bindings.
If we are talking plain Python APIs, Iceberg is in last place with Hudi taking silver and Delta taking gold.
Schema Evolution
Back in the day, when you wanted to change a Hive schema, it required a ton of rewriting and backfill pain. Ryan Blue wanted to end that pain and suffering once and for all with Iceberg.
So, Iceberg decides to manage this with isolated snapshots. This means you can pretty much do whatever you want to your table with the one exception be convert a data type to one that isn’t safe to convert to.
An example of safe conversion:
integer to bigint
integer to string
An example of unsafe conversion:
bigint to integer (might run out of space)
string to bigint (might not be a number)
Partitioning Differences
Let’s compare the differences between partitioning and other things:
Iceberg supports something called “hidden partitioning” as an Iceberg fanboy, this is one of the few design decisions that I do not like very much.
The benefits are:
You don’t have to have an extra partition column
The user doesn’t have to know what the partitioning of the table actually is
The drawbacks are:
As a data engineer, you can’t just query the table and know how it’s partitioned. You have to “guess” or look up the DDL. In big tech, we intentionally used “ds” as a design pattern to specify the partition column which made query tables much easier.
Here’s an example Iceberg DDL partitioned on the “date” of the event_time.
CREATE TABLE iceberg_db.my_iceberg_table (
id STRING,
name STRING,
amount DECIMAL(10,2),
event_time TIMESTAMP
)
USING ICEBERG
PARTITIONED BY (days(event_time))
LOCATION 's3://my-bucket/iceberg/my_iceberg_table';
Delta Table and Hudi have partitioning schemes that are mostly identical to Apache Hive. I like this because I’m a dinosaur of a data engineer now who doesn’t like change.
CREATE TABLE delta_db.my_delta_table (
id STRING,
name STRING,
amount DECIMAL(10,2),
event_time TIMESTAMP,
date DATE
)
USING DELTA
PARTITIONED BY (date)
LOCATION 's3://my-bucket/delta/my_delta_table';
Make sure to set things like primaryKey and MERGE_ON_READ in your Hudi tables for great real-time performance. Keeping in mind that the primaryKey is only deduplicated within that partition!
CREATE TABLE hudi_db.my_hudi_table (
id STRING,
name STRING,
amount DECIMAL(10,2),
event_time TIMESTAMP,
date DATE
)
USING HUDI
TBLPROPERTIES (
'primaryKey' = 'id',
'type' = 'MERGE_ON_READ',
'preCombineField' = 'event_time'
)
PARTITIONED BY (date)
LOCATION 's3://my-bucket/hudi/my_hudi_table';
Optimization and File Compaction
Iceberg, Hudi, and Delta all give you different ways to optimize your files with different tradeoffs.
The API for optimization for Delta is very simple. Specifying ZORDER will improve compaction further. If you don’t specify ZORDER, it will default to binpacking strategy. OPTIMIZE delta_db.my_delta_table ZORDER BY (event_time);
For Iceberg, you want to do something like this:
CALL iceberg_db.system.rewrite_data_files(
table =>'iceberg_db.my_iceberg_table',
strategy => ‘binpack’,
options => map(
'target-file-size-bytes','1073741824'
)
);
Iceberg offers three compaction strategies, very similar to Delta:
binpack (the default in Delta Lake too)
sort (you need to specify which columns are sorted_by in the DDL)
zorder (which is similar to Delta lake)
It also offers a bunch of options, the most important being: target-file-size-bytes, the number of bytes each file is attempted to fit to
Hudi compaction is a bit different. It has two settings: Copy-on-Write and Merge-on-Read.
If you use Copy-on-Write strategy, files are compacted when data is written. This makes for slower writes
If you use Merge-on-Read strategy, writes are fast but reads are slower. You can speed up reads using something like this in Spark:CALL run_compaction('hudi_db.my_hudi_table')
Compaction is complicated. At the end of the day, you can get very good file compaction with any of these frameworks!
Relational Database Analogy
All three of these options will continue to see long-term growth because they solve slightly different use cases.
Iceberg will become like the Postgres of data lakes because:
It has strong community support
It has the most open and “do what you want” mentality
Delta Lake will become like the Oracle of data lakes because:
It has great performance features when running on Databricks
It has a more opinionated “best” way of doing things
Hudi will become like the MariaDB of data lakes because:
It has niche use cases that are powerful
It is full open source and doesn’t want to be like Oracle
If you found this article useful, make sure to share it with your friends! What else should you be considering when picking an open table format?
I cover Iceberg in excruciating detail in my Analytics Engineering boot camp along with Snowflake, dbt, and end-to-end capstone projects. It will be starting on April 14th and ending May 16th. Seats are limited. The first 10 people reading this can get 30% off with code EARLYBIRDAE at DataExpert.io!
Great comparison.
Few things I have experienced:
- Delta streaming is good, although we don't have lot of streaming use case.
- Delta liquid clustering (incremental) equivalent is something missing from Iceberg iirc.
- I noticed when working on Iceberg, you cannot create a table without having a catalog, in delta, you can just write directly to s3 and it will be a delta table independent of catalog.
Hi Zach, great article thank you!
iceberg fanboy too…so the drawback you mention with hidden partitioning, trino just released 472, you can now query partitions via hidden metadata column.
https://github.com/trinodb/trino/issues/24301