Matei Zaharia, UC Berkeley
Today's most popular computer applications are Internet services like Google, Facebook, and Amazon. In addition to serving millions of hits per day on the front-end, these services must analyze hundreds of terabytes of data for applications like search, spam detection and business intelligence on the back-end, using clusters of thousands of machines. I will talk about MapReduce, a simple but surprisingly versatile programming model for clusters that was developed at Google and popularized through the Hadoop open-source project. I will also tour some higher-level programming tools being developed on top of MapReduce and related systems, such as Yahoo's Pig and Microsoft's DryadLINQ, to simplify large-scale parallel programming. Finally, I will show how cloud computing services have made it possible for small companies and research groups to take advantage of these large-scale data processing systems.