Matei Zaharia, Ph.D. Student (UC Berkeley)
High-level programming models for clusters, such as MapReduce and Dryad, have been very successful in implementing large-scale data-intensive applications. However, most of these systems are built around an acyclic data flow model that is not suitable for other popular applications. We present Spark, a new cluster computing framework motivated by one such class of use cases: applications that reuse a "working set" of data across multiple parallel operations. This includes many iterative machine learning algorithms, as well as interactive data mining tools. Spark provides a set of distributed memory abstractions that allow it to efficiently support applications with working sets, while retaining the scalability and fault tolerance of MapReduce. We show that Spark can outperform MapReduce by 10x in iterative machine learning jobs. In addition, Spark can be used interactively from an interpreter, allowing users to load multi-gigabyte datasets into memory across a set of machines and query them with sub-second latency.