Spark offers so many different APIs and languages that it can be overwhelming which way is “best.”
In this article I will be discussing the tradeoffs between each since there’s a lot of dogma and misinformation out there about it!
The SparkSQL API
SQL APIs are data scientists and analysts best friend. Since SQL is the lingua franca of the data space, SparkSQL should be associated with openness.
SparkSQL is often best for pipelines that:
Are built in collaboration with non-engineers
Are subject to a lot of mutation and change
Only work on data sources that are already in the warehouse / data lake
On the flip side, software engineers think SQL is terrible. They will say, “pick DataFrames because SQL isn’t modular.”
SparkSQL isn’t the best for pipelines that:
Leverage 3rd party sources such as REST APIs, Kafka topics, or GraphQL
Have complex integration with other systems (e.g. compiles server libraries)
Need extensive unit and integration test coverage
Need modularity
The DataFrame API
DataFrames should be associated with middle ground approach. Analysts and data scientists sometimes know these and that’s okay if they don’t!
DataFrames are often best for pipelines that:
Require fewer changes and are more “hardened”
Have 3rd party integrations from REST APIs or other non-table sources
Need extensive unit and integration test coverage (Chispa is pretty good for PySpark testing)
Since DataFrames are less known by other data professionals, they have their own limitations as well.
DataFrames aren’t the best for pipelines that:
Need collaboration between many non-engineer professionals
Need static typing guarantees that the Dataset API offers
The Dataset API
Datasets are the least common API to work with. The main reason for that is it’s offered only in Java and Scala! The rise of PySpark has made this API less relevant. But when I worked at Airbnb, we were required to use this API for any MIDAS pipelines!
Datasets are often best for pipelines that:
Need static typing guarantees. This makes CI/CD much more powerful than Python based pipelines. Unit testing with Datasets is so good!
Are owned by strong JVM-based developers. If your company has tons of strong Java and Scala engineers or you have a backend like Spring Boot or something like that, the integrations here can be powerful!
Are part of a larger ecosystem of pipelines with many dependencies. Python dependency management is terrible. Scala’s is vastly superior. Gradle makes pip look like little league!
Datasets aren’t the best for pipelines that:
Are owned by engineers that don’t want to cry learning Scala
Need faster iteration cycles. Uploading a built JAR is significantly slower than uploading a PySpark script or a SparkSQL
Need to be collaborated on by many non-engineers
We cover this and much more Spark stuff in greater detail with hands-on examples in the DataExpert.io Academy. The first ten people to use code DATAFRAME at checkout before August 31st can get 30% off!
What is your favorite API to use with Spark? Please share this with all your data engineer friends who are learning Spark!
💯 agree.
We have all types of pipelines using a mix of these. Oke thing I know it's hard to find Scala and especially Datasets experienced folks.