Using Random Forest Machine Learning to Assess and Predict Teamwork Learning in Software Engineering Class 2010


Shenhaochen Zhu

Oral Defence Date: 



TH 434


Professors Dragutin Petkovic and Kazunori Okada


Research suggests that roughly 10% of software (SW) projects are abandoned, about 30% fail and more than 50% exceed cost and time schedule. Research also found out that communication, organization and teamwork aspects of Software Engineering (SE) are primary reasons for project failures. Therefore, there is urgent need for educators in SE to address this issue, which includes not only teaching but also assessment of students. ability to learn and apply SE teamwork, namely to adhere to SE process and be able to produce SE products in team setting. Current literature on student learning and assessment of SE teamwork skills mostly relies on qualitative and subjective data from class surveys and instructor observations at the end of an academic term. Because the subjective and qualitative nature of the collected data which are heavily dependent on the human subjective judgment, these instruments are difficult to use consistently and repetitively. Simplistic data analysis methods fail to address complex interactions among team members and the tools they use. Faculty at San Francisco State University, jointly with Florida Atlantic University and Fulda University, Germany have been doing research in novel ways of assessment of teamwork using Machine learning (ML) to analyze data extracted from their joint SE classes. Our contributions in this project are in the use of Random Forest (RF) ML method to assess and predict student teamwork learning. RF machine learning method was applied on Training Database consisting of student teamwork data collected by previous graduate student from SE class in Fall 2010. We run grid search to find the best RF parameters that generate the lowest Out-of-bag (OOB) predictive error for each time interval. We also used Gini variable importance measure of RF to determine three most important ones for each time interval. Finally, we verified the reliability of computing most important variables for the best predictive time interval in SE process and SE product. Due to small training database our results are very preliminary but we believe they validate the approach to use RF to predict student teamwork learning (in some cases OOB in the range of 35% and ranking of important variables is very robust). Hence, we believe that this work validates the overall approach and that it will produce more significant results once larger training database is available.


Software Engineering Education, assessment, machine learning, random forest


Shenhaochen Zhu