Tom Gianos, senior software engineer at Netflix, and Dan Weeks, engineering manager for Big Data Compute at Netflix, discussed their big data strategy and analytics infrastructure at Netflix during QCon San Francisco 2016. This included a summary of the scale of their data, their S3 data warehouse, and Genie, their big data federated orchestration system.
In order to explain their requirements, Weeks explained that the biggest big data challenge at Netflix is scale. They have over 86 million members globally, streaming over 125 million hours of content per day. This results in a data warehouse which is over 60 petabytes in size.
Whilst people may think of video streaming data as the source for analytics at Netflix, Weeks explained that they mainly deal with other types of data, such as events produced by their various microservices and campaigns. Weeks elaborated:
Netflix is a very data driven company. We like to make decisions based on evidence. We don’t want to make changes to the platform that we can’t substantiate will actually improve the experience for users.
Weeks listed one use case for this type of data as A/B testing. Data scientists are able to analyse user interactions and then decide on what features to push out permanently.
Weeks also gave an overview of the data pipeline architecture at Netflix. There are two streams, one for event data, and one for dimension data. Event data goes through their Kafka data pipeline, whereas dimension data is pulled from their Cassandra cluster using their open source tool, Aegisthus. Ultimately, all types of data end up in S3.
Whilst traditionally data warehouses would use HDFS, Weeks explained some of the advantages of using S3 as an alternative. These include 99.99% availability, versioning, and the ability to decouple compute from storage. The last point was key, as despite the increase of latency from the lack of data locality, it meant that it became easy to scale compute clusters and perform upgrades, whilst avoiding having to move the data.
Once in their data warehouse, in order to get to the data that they want, Weeks explained that they use a metadata system called Metacat. Specifically, it provides information on how to process the data, along with what it is and where it is located. Because it’s a federated system, it works on top of Hive, RDS, S3 and more.
Weeks also explained that the data itself is stored in the Parquet file format. It is column oriented, allowing for improved compression. Parquet files also store additional metadata, such as information about the min / max length of columns and their sizes. This allows operations such as counts or skipping to be performed very quickly.
More information on Parquet tuning is published by Ryan Blue, senior software engineer at Netflix.
Whilst Weeks had discussed big data at a lower level, Gianos explained things from a higher level perspective. Most of the focus was on Genie, a federated orchestration engine designed to manage various big data jobs such as Hadoop, Pig, Hive and more.
To explain the requirement for Genie, Gianos started with a simple use case of a few users all accessing the same cluster. Whilst this is straightforward to manage, with a larger organisation it can evolve into a situation where there is a requirement for more client resources, cluster resources, and more complicated deployments. This results in data scientists struggling with issues like their jobs being slow, data processing libraries being out of date, and more, all making it difficult for a system administrator to respond to these issues easily.
Gianos explained that Genie allowed system administrators to launch and manage clusters, install binaries, and more, whilst being decoupled from the end user. From a user perspective, it then gives them an abstraction for accessing the clusters without having to worry about how to connect to them or know what to run on them.
For cluster updates, Gianos explained that all that’s needed is for tags to be migrated across to the new cluster once its tests have passed. Genie orchestrates everything such that old jobs will still finish on the old cluster, whilst any new jobs will be run on the new one. This results in no down time.
Genios also explained that Genie’s tagging mechanism can be used for load balancing, by simply replicating tags across clusters in order to split load. From a client perspective, this is completely transparent.
Genios also explained the binary update mechanism of Genie. New binaries are moved across to a centralised download location, and then automatically used to replace the old ones on the next invocation.
From a data scientist perspective, Genios demonstrated the Genie workflow. Essentially, they submit a job to Genie, which includes metadata such cluster tags, and the big data processing engine they would like to use. Genie is then able to find the appropriate cluster to run the job on. The Genie UI then gives the user feedback on the job that’s running.
For more information, the full presentation can be watched online.