Cloud Computing with MapReduce and Hadoop

Wednesday, September 16, 2009 - 17:30
TH 331
Matei Zaharia, UC Berkeley

Today's most popular computer applications are Internet services like Google, Facebook, and Amazon. In addition to serving millions of hits per day on the front-end, these services must analyze hundreds of terabytes of data for applications like search, spam detection and business intelligence on the back-end, using clusters of thousands of machines. I will talk about MapReduce, a simple but surprisingly versatile programming model for clusters that was developed at Google and popularized through the Hadoop open-source project. I will also tour some higher-level programming tools being developed on top of MapReduce and related systems, such as Yahoo's Pig and Microsoft's DryadLINQ, to simplify large-scale parallel programming. Finally, I will show how cloud computing services have made it possible for small companies and research groups to take advantage of these large-scale data processing systems.


Matei Zaharia got his undergraduate degree from the University of Waterloo in Canada and is currently a third-year PhD student at UC Berkeley. He works in the Reliable Adaptive Distributed systems lab (RAD Lab) with professors Scott Shenker and Ion Stoica, on topics in cluster computing and networking. He worked on Facebook's data infrastructure team in summer 2008. He is also a committee member on the Apache Hadoop project.