Criteo is a data-driven company. Every day we digest dozens of terabytes of new data to train recommendation models that serve requests at the scale of the internet. Spark is our tool of choice for processing big data. It is a powerful and flexible instrument, but it has a pretty steep learning curve, and effective usage often requires reading source codes of the framework.
The fast processing of big data has a critical business impact for us:
- we refresh our models often, which brings extra performance for our clients
- we have a low time-to-market for new ML-powered products because we can iterate quickly
- it impacts our infrastructure cost
this post, I will discuss writing efficient Spark code and demonstrate
on toy examples common pitfalls. I show that Spark SQL (Datasets) should
generally be preferred to Spark Core API (RDD) and that by making the
right choice, you can win 2 to 10 times in the performance of your big
data jobs, which matters.