Hybrid Approach for Classifying and Summarizing Multiple Documents
Oral Defence Date:
Thursday, December 17, 2009 - 10:00
Professors Hui Yang, and Rahul Singh
Internet has become the most prominent venue of communicating with and collecting data from the user groups. One important application on the internet is to purposefully invite audience to take part in an online survey system, aiming to collect individual opinions about a new product or an emerging field. Online survey questions can be of two types, Likert scale questions and open-ended questions. Likert scale question does not give enough freedom for the users to express their thoughts. Answers to open-ended questions on the other hand are free-form textual-answers and gives complete control and freedom for the users to express their thoughts. Since a lot of time and effort would be put in designing the survey questionnaire, it is important to get all the information we can. These surveys usually target towards a particular user group. Survey-answers obtained for a particular user group needs to be analyzed. Since these user groups can get quite large, it is infeasible for a human to read all the answers and analyze. In this thesis we describe algorithms that automatically cluster these answers from the users and identify the main topics. Some of the peculiar features exhibited in such responses are that they are short in length, rich in outliers, and diverse in categories. Traditional unsupervised and semi-supervised clustering techniques are challenged to achieve satisfactory performance as demanded by a survey task. We address this issue by proposing a semi-supervised, topic-driven approach for clustering of answers. It first employs an unsupervised algorithm to generate a preliminary clustering schema for all the answers to a question. A human expert then uses this schema to identify the major topics in these answers. Finally, a topic-driven clustering algorithm is adopted to obtain the final clustering schema. To generate the final set of topics, we find an answer in each cluster which best represents the whole cluster. For identifying topics from the clustering schema, we employ four summarization techniques: (1) Topic identification with varied weight, (2) Topic identification using cosine distance, (3) Topic identification using KL distance, and (4) an Ensemble approach. Our algorithms utilize both syntactic and semantic information for clustering analysis and topic identification. The evaluation results show that the algorithms used for clustering and topic identification are effective in identifying the main topics.
Text mining, clustering, multi-document summarization, topic identification, semi-supervised approach