Variable Importance in Microenvironment-based Protein Function Analysis


Arthur Vigil

Oral Defence Date: 



TH 434


Prof. Dragutin Petkovic, Assoc. Prof. Kaz Okada, Reasearch Assoc. CCLS Mike


Over the past few decades, Machine Learning (ML) researchers have made great progress in developing methods and approaches for building predictive models which perform very well in a variety of envi-ronments and situations. ML researchers are often content with assessing the success or failure of a par-ticular model based on the performance of that model in making correct predictions. When moving from academic to practical real world application, it is often critical to be able to explain how ML decisions and classification have been made such that they are more understandable to humans who are experts in their domain but not ML experts and who either have to use, maintain, or convince others to use ML. Our work focuses on providing better explainability in ML, specifically for the Random Forest modeling approach. Previous work has demonstrated that Random Forest is competitive with other state of the art ML approaches in terms of predictive performance. In this work we investigate Random Forest as a tool for promoting explainability using both existing and novel methods for assessing variable importance. We apply our approach to the Stanford FEATURE dataset to demonstrate its utility in better understanding protein function as it relates to structural properties contained in that data set. We have formalized the explainability problem and developed a one page “explainability summary” which we believe increases explainability for non-expert users. A large set of experiments has been done on 7 FEATURE models to develop and evaluate our approach. The work has been supported in part by NIH grant LM05652 and the SFSU Center for Computing for Life Sciences.


Machine learning; random forest; Stanford feature; explainability


Arthur Vigil