Troubleshooting Petascale Distributed Science

Wednesday, April 30, 2008 - 17:30
TH 331
Dan Gunter, Lawrence Berkeley National Labs

Petascale science involves not only the creation of massive datasets at supercomputers or experimental facilities, but also the subsequent analysis of that data by a user community that may be distributed across many laboratories and universities. This talk will describe at a high level the challenges of: developing tools for the detection and diagnosis of failures in end-to-end data placement and distributed application hosting configurations; constructing an end-to-end monitoring architecture that uses instrumented services to provide detailed data for both background collection and run-time event-driven collection; and constructing new monitoring analysis tools able to detect failures and performance anomalies and predict system behaviors using archived data and event logs.


Dan Gunter is a Computer Scientist at Lawrence Berkeley National Laboratory (LBNL). He has been working at LBNL since 1999, when he was a student intern still finishing his M.S. in Computer Science at SFSU. He is currently involved in two projects funded by the DOE Office of Science: the Center for Enabling Petascale Distributed Science (CEDPS) and the Performance Engineering Research Institute (PERI).