Criteo & Spark : Under the hood of Spark performance, or why query compilation matters

Image for post

Criteo is a data-driven company. Every day we digest dozens of terabytes of new data to train recommendation models that serve requests at the scale of the internet. Spark is our tool of choice for processing big data. It is a powerful and flexible instrument, but it has a pretty steep learning curve, and effective usage often requires reading source codes of the framework.

The fast processing of big data has a critical business impact for us:

we refresh our models often, which brings extra performance for our clients
we have a low time-to-market for new ML-powered products because we can iterate quickly
it impacts our infrastructure cost

In this post, I will discuss writing efficient Spark code and demonstrate on toy examples common pitfalls. I show that Spark SQL (Datasets) should generally be preferred to Spark Core API (RDD) and that by making the right choice, you can win 2 to 10 times in the performance of your big data jobs, which matters.

https://medium.com/criteo-labs/under-the-hood-of-spark-performance-or-why-query-compilation-matters-c084e749be87

La donnée intelligente

Search This Blog

Criteo & Spark : Under the hood of Spark performance, or why query compilation matters

Comments

Post a Comment