Tabular.io was acquired by Databricks back in June 2023 for two billion dollars. This is despite only making $1-2m/year in revenue. How is a company making $1m/year worth 2 billion?
It’s simple. Apache Iceberg is the hottest technology in data engineering right now and Databricks and Snowflake had to fight over it.
In this article, we will talk about:
Why is Apache Iceberg so much better than things like Hive
What will the future data lake architectures look like
How to build scalable data lake systems with Iceberg
Why is Apache Iceberg so much better than Hive
Apache Hive unlocked so many things in the big data space from 2010 to today! It had a few critical flaws that need to be addressed!
No INSERT INTO, only INSERT OVERWRITE
You had to operate on the Hive metastore at the partition level, never at the file level. If you wanted to add more data for a day, you had to overwrite the entire day!
Hive forced partition overwrites as a means to prevent file corruption and keeping everything consistent. This pattern is where functional data engineering was born!
The ethos of functional data engineering is:
Idempotent pipelines
Parallelizable backfills
Table partitions are immutable objects (must be overwritten, never added to)
In the world of batch processing, these principles created an era where we could finally achieve DATA QUALITY AT SCALE
But what about other use cases? What about low-latency use cases where we don’t want to keep overwriting partitions? This is the first reason why Iceberg was developed at Netflix!
Iceberg unlocks INSERT INTO for low-latency use cases. If you’re a seasoned data engineer, you might be asking, “But what about the small file problem?”
The small file problem
Hive handles the small file problem by forcing the writer to write one consolidated partition which makes handling this problem a breeze.
Iceberg offers file compactions, both manually and through platforms. Tabular offers regularly scheduled file compaction so you don’t have millions of tiny files bogging down the efficiency of your low latency pipelines!
Compact files is a breeze in Spark. Just call this single line:
spark.sql(f"CALL system.rewrite_data_files(table => 'catalog.my_iceberg_table')")
Restoring data after failure or error
In Hive, when you overwrite data with bad data, it was a huge problem! Bouncing back from data failures were costly, time consuming and painful!
Iceberg has a snapshotting feature that is similar to Git and allows for things like Time traveling and
How to build scalable architecture with Iceberg
In a conversation with Jason Reid (cofounder of Tabular), we talked about how Iceberg will become like Postgres but for data lakes.
Iceberg will act as the agnostic, cheap, high-latency storage layer for the rest of your data systems.
You can see how vendors are scrambling to support Iceberg:
Snowflake supports external and managed Iceberg tables
Databricks bought Tabular to integrate with Unity catalog and have a fully support platform!
Confluent developed TableFlow which allows you to have Flink write directly to Iceberg tables!
AWS introduced S3 Tables which are Iceberg based
The number of billions of dollars being poured into this technology is overwhelming!
Iceberg source isn’t suitable for all use cases
Iceberg data travels slowly because it’s files on S3. So the latency will never be great! Please never build software applications or dashboards on top of Iceberg data. The latency will always be "kinda bad”
Moving latency-sensitive data to more suitable stores such as managed Iceberg tables in Snowflake (not external since they will still be slow), or better yet, picking a low-latency store like Redis or Druid!
Common architecture pattern changes
Daily production dimension snapshots will be shifted to change data capture pipelines that are low latency and generate small files.
Kappa architecture will beat Lamba architecture for most use cases given how Iceberg is blurring the line between data-in-motion and data-at-rest.
What other changes do you think will happen in 2025? If you enjoyed this article, make sure to share with your friends! What other topics should I write about this year?
Some materials from this article were source from the first day of the January 2025 boot camp! The first 5 people to use code ICEBERG30 at checkout at DataExpert.io will get access to either the January 2025 boot camp (we started last week but you still have time to join!) or an annual/monthly subscription to the DataExpert.io academy!
Hi Zach,
Nice introductory post but how can you write about Iceberg, compare it to Hive and not mention Delta table at all? Looks like a paid post without being tagged as such.
Cheers
Michał