A Systematic Study Of Twelve Feature Selection Algorithms And Their Impact On Text Classification Task


Suchi Vora

Oral Defence Date: 



TH 434


Profs. Hui Yang, James Wong & Assistant Prof. Anagha Kulkarni


Feature selection (and ranking) has been routinely used as a preprocessing step to conquer the ‘curse of dimensionality’ when mining high-dimensional big data (e.g., text data). In contrast to dimensionality reduction techniques such as PCA, the resulting features from feature selection are more interpretable. Feature selection algorithms are often more scalable at handling today’s big data. A large host of feature selection algorithms have been proposed in the literature. A critical challenge is to figure out what algo-rithm should be used in such a case. In this study, we address this issue by (1) implementing an open source software system, which consists of 12 feature selection algorithms and subsequently supports five common classifiers. It is available at following link https://github.com/SuchiVora/Feature-Selection-Pipeline. (2) systematically comparing and evaluating the selected features and their impact on these classifiers using five datasets. Nearly 1000 experiments were conducted in our evaluation. Our main ob-servations include: (1) Algorithms in the following groups produce almost identical or highly similar results: (Chi-square Score, Information Gain), (Fisher Score, Laplacian score, Gini Index), and (FCBF, CFS, mRmR); (2) filter based methods deliver better classification results and (3) from our set of five classifiers, SVM generally performs the best with majority of the feature selection algorithms. These observations will provide practical guidelines for the data analytic community at large.


Feature Selection; Feature Ranking; Text Classification; Evaluation and Comparison


Suchi Vora