Data-centric Program Analysis for Distributed Systems

Tuesday, February 10, 2015 - 09:45
TH 409
Peter Alvaro, Ph.D. Candidate (UC Berkeley)
Modern data management systems are increasingly distributed across large collections of machines, both to store and process the rapidly-increasing data volumes now produced by even modest-sized enterprises and to satisfy the growing hunger of analysts for “big data.” Due to fundamental complexities such as asynchronous communication and partial failure, distributed systems are notoriously difficult to program and reason about. To make matters worse, distributed systems are no longer the sole domain of experts. The relatively recent accessibility of large-scale computing resources (e.g., the public cloud), and proliferation of reusable data management components (e.g., “NoSQL” stores, data processing frameworks, caches and message queues) have created a crisis: all programmers must learn to be distributed programmers. Few tools exist to assist application programmers, data analysts and mobile developers to struggle with these tradeoffs. In this talk, I describe my thesis work developing new languages, analyses and tools to simplify the task of implementing and reasoning about large-scale distributed data-management systems. I focus in particular on analysis techniques that ensure determinism of program outcomes despite pervasive nondeterminism in their distributed executions. The first technique, monotonicity analysis, identifies programs that are tolerant to nondeterministic message orders, and repairs programs that are not. The second, lineage-driven fault injection, helps provide guarantees that distributed programs are fault-tolerant.

Peter Alvaro is a PhD candidate (degree expected in May 2015) at the University of California Berkeley, where he is advised by Joseph M. Hellerstein. His research focuses on applying data-centric languages and analysis techniques to program and study data-intensive distributed systems. He is the creator of the Dedalus language and co-creator of the Bloom language.
Peter holds a Master's degree in Computer Science from UC Berkeley and a Bachelor of Arts in English Literature from Middlebury College. At Berkeley, he served as Co-Instructor for Programming the Cloud, an undergraduate course that studied distributed systems through the lens of software development. Prior to attending UC Berkeley, Peter worked as a Senior Software Engineer in the data analytics team at His principal interests are databases, distributed systems and programming languages. His webpage is