Criteo & Spark : Under the hood of Spark performance, or why query compilation matters

In this post, I will discuss writing efficient Spark code and demonstrate on toy examples common pitfalls. I show that Spark SQL (Datasets) should generally be preferred to Spark Core API (RDD) and that by making the right choice, you can win 2 to 10 times in the performance of your big data jobs, which matters.

Read more: 

https://medium.com/criteo-labs/under-the-hood-of-spark-performance-or-why-query-compilation-matters-c084e749be87

Comments