BlinkDB : Queries with Bounded Errors and Bounded Response Times on Very Large Data


Today a few words about impressive technology: BlinkDB.

Sameer Agarwal† from Berkeley wrote a paper on 2013 about this new technology for querying large datasets.

BlinkDB is a massively parallel, approximate query engine for running interactive SQL queries on large volumes of data. It allows users to trade-off query accuracy for response time, enabling interactive queries over massive data by running queries on data samples and presenting results annotated with meaningful error bars. 
To achieve this, BlinkDB uses two key ideas: 
(1) An adaptive optimization framework that builds and maintains a set of multi-dimensional samples from original data over time
(2) A dynamic sample selection strategy that selects an appropriately sized sample based on a query’s accuracy and/or response time requirements. 

BlinkDB has been evaluated on the well-known TPC-H benchmarks, a real-world analytic workload derived from Conviva Inc. and are in the process of deploying it at Facebook Inc. 

Read more here: