When you start coding in Spark, you will wondering when to use dataset, RDD or dataframe:
a clear answer below
If you are developing primarily in Java then it is worth considering a move to Scala before adopting the DataFrame or Dataset APIs. Although there is an effort to support Java, Spark is written in Scala and the code often makes assumptions that make it hard (but not impossible) to deal with Java objects.
If you are developing in Scala and need your code to go into production with Spark 1.6.0 then the DataFrame API is clearly the most stable option available and currently offers the best performance.
However, the Dataset API preview looks very promising and provides a more natural way to code. Given the rapid evolution of Spark it is likely that this API will mature very quickly through 2016 and become the de-facto API for developing new applications.
All details into this post:
Post a Comment