Troubleshooting Petascale Distributed Science
Petascale science involves not only the creation of massive datasets at supercomputers or experimental facilities, but also the subsequent analysis of that data by a user community that may be distributed across many laboratories and universities. This talk will describe at a high level the challenges of: developing tools for the detection and diagnosis of failures in end-to-end data placement and distributed application hosting configurations; constructing an end-to-end monitoring architecture that uses instrumented services to provide detailed data for both background collection and run-time event-driven collection; and constructing new monitoring analysis tools able to detect failures and performance anomalies and predict system behaviors using archived data and event logs.
Dan Gunter is a Computer Scientist at Lawrence Berkeley National Laboratory (LBNL). He has been working at LBNL since 1999, when he was a student intern still finishing his M.S. in Computer Science at SFSU. He is currently involved in two projects funded by the DOE Office of Science: the Center for Enabling Petascale Distributed Science (CEDPS) and the Performance Engineering Research Institute (PERI).