Microenvironment-based Protein Function Analysis by Random Forest


Lorenzo T. Flores

Oral Defence Date: 



TH 434


Profesors Petkovic, Okada and Yang


Machine learning-based protein function prediction allows discovery of new drugs in high-throughput settings. Stanford's FEATURE framework represents protein structural information as microenvironment descriptors and facilitates molecular functional site prediction using Naive Bayes (NB) classification. Prior work at San Francisco State University, in collaboration with Stanford's Helix group, has demonstrated superior functional site predictive capabilities using the Support Vector Machine(SVM)algorithm. This thesisdetails our efforts to compare Random Forest (RF) classification against both NB and SVM as well as evaluate RF Variable Importance efficacy. We compare RF accuracy to NB and SVM using same accuracy measures and data sets, namely recall and precision values generated via parameter grid-search using stratified 5-fold cross-validation. Variable importance (VI) efficacy for RF was evaluated by calculating MDA and MDG, accounting for RF’s stochastic nature. We establish that RF just slightly outperforms SVM when predicting functional sites from FEATURE data. We also confirm that commonly used OOB error estimates for RF are not good performance measures for FEATURE data due to highly unbalanced data set. Finally, we find that both VI measures must be modified to properly distinguish FEATURE properties contributing to 'site' classifications. This work has been partially supported with NIH grant 5R01LM005652 and SFSU center for Computing for Life Sciences.


Protein function analysis, FEATURE, machine learning, Random Forest


Lorenzo T. Flores