Spark: Distributed Memory Abstractions for Cluster Computing

Wednesday, October 27, 2010 - 16:30
TH 331
Matei Zaharia, Ph.D. Student (UC Berkeley)
High-level programming models for clusters, such as MapReduce and Dryad, have been very successful in implementing large-scale data-intensive applications. However, most of these systems are built around an acyclic data flow model that is not suitable for other popular applications. We present Spark, a new cluster computing framework motivated by one such class of use cases: applications that reuse a "working set" of data across multiple parallel operations. This includes many iterative machine learning algorithms, as well as interactive data mining tools. Spark provides a set of distributed memory abstractions that allow it to efficiently support applications with working sets, while retaining the scalability and fault tolerance of MapReduce. We show that Spark can outperform MapReduce by 10x in iterative machine learning jobs. In addition, Spark can be used interactively from an interpreter, allowing users to load multi-gigabyte datasets into memory across a set of machines and query them with sub-second latency.

Matei Zaharia is a fourth year graduate student at UC Berkeley. He works with professors Scott Shenker and Ion Stoica on topics in cloud computing, operating systems and networking. He is also a committer on the Apache Hadoop MapReduce project. He got his undergraduate degree at the University of Waterloo in Canada.