Measuring the Accuracy of Clustering Algorithms

Wednesday, November 3, 2010 - 16:30
TH 331
Byron Dom, Ph.D. (Content Sciences, YAHOO)

In contrast to supervised classification, (unsupervised) clustering does not admit an obvious characterization (e.g. classification error) of the associated accuracy. There is not even total agreement among workers as to what constitutes the best general way of approaching this issue. Nonetheless, a commonly used means of assessing the accuracy of clustering algorithms is to compare the clusters they produce with "ground truth" consisting of classes assigned manually or by some other trusted means. Measures of such agreement are referred to as "external" or "extrinsic" and will be the focus of this talk. Having decided to use such an approach, however, one is still left with the question of what measure of agreement between the two classification schemes to use.

This talk will:

describe the problem of measuring the accuracy of clustering algorithms, emphasizing its unique aspects.
survey methods that have been proposed and used for measuring clustering accuracy
describe in detail an information-theoretic measure that overcomes the problems associated with other measures. This measure has been used here in recent work on clustering Yahoo!-News search results. It measures utility of the cluster labels as predictors of their associated ground-truth class labels by computing the reduction in the number of bits that would be required to encode the class labels conditioned on the cluster labels.
derive the measure of Item 3 based on information-theoretic considerations and show experimentally that it satisfies a set of desired properties.
relate the described measure to recent, closely related work by others.


Byron Dom is a Principal Research Scientist in Yahoo's Natural Language Processing department. Prior to taking this position, he was Director of Automated Content Analysis in the Yahoo! Applied Research division, where he and his team focused on applying machine learning to problems in text mining such as document categorization, clustering and information extraction. He led the development of Yahoo's first production automatic document-categorization system, which categorizes merchant product offers for Yahoo Shopping. Prior to joining Yahoo! in April of 2003, he spent twenty years in IBM's Research division (Watson Research Center in Yorktown Heights, New York and Almaden Research Center in San Jose, California) first as a physicist and later as a computer scientist, when his physics research led him into the field of computer vision. His work eventually led him into the areas of machine learning and automated text analysis, which are where he focuses his work today. At IBM he was Research Staff Member and manager of Information Management Principles, which focused on Web information retrieval and text mining. He has served as an associate editor of the IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI). He received a PhD in physics from the Catholic University of America.