RFEX 2.0 – Improved RF Explainability: Case study of single-cell transcriptomics dataset from J. Craig Venter Institute
Jizhou YangOral Defence Date:
Monday, May 6, 2019 - 11:45Location:
Prof. Dragutin Petkovic and Prof. Hui Yang
Machine learning (ML) is offering tremendous promise and has been widely used in life science and many other areas. Despite ML’s widespread success and promises, many ML models are still considered a black box since they are often complex and difficult to understand or interpret. This in turn causes problems in adopting these ML solutions, auditing their accuracy and in the ability to maintain them. Such concerns are now surfacing and causing legal and regularity actions as well as concerns among adopters and users. Improving the explainability of ML is one approach to address these issues. Random Forest (RF) ML is one of the most popular ML algorithms, due to its classification accuracy as well as some level of explainability using its own feature ranking. In previous work in Petkovic lab, Random Forest EXplainability (RFEX) pipeline has been developed to produce an easy to interpret summary report and has been applied to biomedical data. In this report, our goals were to: a) improve RFEX to make it more statistically reliable, and b) to apply RFEX 2.0 on a single-cell RNA-seq data from Prof. Dr. Scheuermann’s lab at the J. Craig Venter Institute (JCVI) in San Diego. Working with Prof. Petkovic and other team members, we improved RFEX by adding more reliable measures in its pipeline, and created RFEX 2.0. Then we performed in depth cell type classification on three major cell types from the JCVI data by applying RFEX 2.0 and we analyzed the results. Our results confirmed the initial approach at JCVI and offered deeper insights into how RF performed its very successful classification. Among other findings we also confirmed the need to use class-specific RF feature ranking in case of unbalanced data, an approach rarely implemented in current applications of RF. We also explored the use of Balanced Random Forest (recommended for unbalanced data sets) but found it not to be so effective.