A Case Study on Using Machine Learning Algorithms to Classify Textual Sentences
Oral Defence Date:
Professors Hui Yang, Dragutin Petkovic, and James Wong
Rapid advances in the biomedical fields have led to the generation of an explosive amount of textual data published in biomedical research literature every day. To harness such data, techniques such as natural language processing (NLP), machine learning, and text mining have been widely adopted to help scientists better extract knowledge from such an extensive data pool. Entity-entity relationship extraction is one of the most important and common tasks of this process, with sentence-level classification constituting a pre-step of such relationship extraction. In this thesis, we describe in detail our study on constructing and selecting relevant feature space toward the effective classification of relationship-bearing sentences from bio-medical literature. We propose to construct a novel set of features, called dependency path feature (DPF), utilizing NLP techniques including parser and typed dependency information. We use our feature space to train four representative machine learning classifiers and compare their accuracy and performance. Results show that the simple rule-based method yields comparable, if not greater, accuracy as compared to more complex machine learning classifier methods. Overall, machine learning classifiers demonstrate better precision, whereas the rule-based method has better recall.
sentence classification, machine learning, classifier, feature space design, natural language processing, relationship extraction