Detecting Large-Scale System Problems by Mining Console Logs
The console logs generated by an application contain information that the developers believed would be useful in debugging or monitoring the application. Despite the ubiquity and large size of these logs, they are rarely exploited because they are not readily machine-parsable.
We propose a novel approach for mining this source of information for system problem detection. We first combine log parsing and text mining with source code analysis to extract structure from the console logs. We then generate features from the structured information in order to detect anomalous patterns in the logs using a method based on Principal Component Analysis (PCA). Finally, we use a decision tree to distill the detection results to a format readily understandable by domain experts (e.g., developers, integrators and operators) who need not be familiar with the anomaly detection algorithms. The whole process requires no human intervention. We validate our approach using the Darkstar online game server and the Hadoop File System, where we detect numerous real problems with high accuracy and few false positives. In the Hadoop case, we are able to analyze 24 million lines of console logs in 3 minutes. Our methodology works on textual console logs of any size and requires no changes to the service software, no human input, and no knowledge of the software's internals.
This is a joint work with Wei Xu, Armando Fox, David Patterson and Michael Jordan at UC Berkeley.
Ling Huang is a Research Scientist at the Intel Labs Berkeley. He received his Ph.D. in Computer Science from the University of California, Berkeley in 2007. During his Ph.D. study, he was affiliated with RadLab, working with Anthony Joseph and Michael I. Jordan on decentralized anomaly detection. Ling Huang's primary research interests are in machine learning and distributed systems, with focus on making computers more intelligent in understanding and interacting with each other and with human. His current projects include low-power sensing and perception, fast machine learning method for large-scale data and systems, system modeling, optimization and problem diagnosis, network traffic monitoring and anomaly detection, etc.