Automatic Question Categorization in Yahoo! Answers

Wednesday, April 22, 2009 - 17:30
TH 331
Byron Dom, Principal Research Scientist, Yahoo!
Yahoo! Answers addresses the problem of information retrieval by providing a forum in which users can ask questions and other users provide answers. In addition to providing this forum, as these questions and answers accumulate over time in the Yahoo! Answers database, they form a tremendous reservoir of information that can be searched and/or browsed to find specific facts, advice and so on. A second dimension of Yahoo! Answers is social interaction. In addition to providing a platform and infrastructure for this activity, technology can be used to facilitate the tasks that users perform in such a system. An example is the categorization of questions by topic, location and type. This talk will begin with a brief description of Yahoo! Answers including descriptions of the problems on which machine-learning techniques can be brought to bear. Following that the question-categorization application and associated machine-learning techniques will be described in more detail, focusing on the way in which the application of machine learning significantly facilitates tasks faced by users of Yahoo! Answers.

Byron Dom is a Principal Research Scientist in Yahoo's Natural Language Processing department. Prior to taking this position, he was Director of Automated Content Analysis in the Yahoo! Applied Research division, where he and his team focused on applying machine learning to problems in text mining such as document categorization, clustering and information extraction. He led the development of Yahoo's first production automatic document-categorization system, which categorizes merchant product offers for Yahoo Shopping. Prior to joining Yahoo! in April of 2003, he spent twenty years in IBM's Research division (Watson Research Center in Yorktown Heights, New York and Almaden Research Center in San Jose, California) first as a physicist and later as a computer scientist, when his physics research led him into the field of computer vision. His work eventually led him into the areas of machine learning and automated text analysis, which are where he focuses his work today. At IBM he was Research Staff Member and manager of Information Management Principles, which focused on Web information retrieval and text mining. He has served as an associate editor of the IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI). He received a PhD in physics from the Catholic University of America.