Recognition of Protein Binding Sites Using Support Vector Machines


Gurgen Tumanyan

Oral Defence Date: 



HH 301


Professors, Petkovic, Okada, and Buturovic (adjunct)


Detection of functional sites of proteins is an important problem in computational biology and has wide implications in computational drug discovery. Prior work in the area includes prediction of protein function based on protein structure, sequence similarity and molecular dynamics simulations. FEATURE by Stanford Helix Group is a system for predicting protein function based on a set of structural and biophysical properties of functional microenvironments and it uses Nave Bayesian classification for recognition of functional sites based on the properties of known protein binding sites. In this work we explore the application of novel recognition methods for FEATURE system with the goal of increasing its accuracy in correctly predicting protein binding sites, and especially its specificity. The challenges encountered in this recognition problem include highly imbalanced training data where the number of true positive samples (e.g. protein binding sites) is about two orders of magnitude smaller from the number of false positive samples. Support Vector Machine (SVM) was selected because of its tolerance of high-dimensional low sample size problems. We analyze the performance of the Support Vector Machine learning algorithm and compare it with Naïve Bayesian algorithm currently used in FEATURE, using identical accuracy measures, data, and experiment methodology. Our results indicate that for the examined functional families Support Vector Machine classification is advantageous for prediction of functional sites. It was also established that the SVM approach is capable of identifying functional sites that both Naive Bayesian classification and sequence based methods misclassify. We have also established that for Naive Bayesian classification neither homology filtering, nor the selection of the prior have effect on classification accuracy. The improved classification mechanisms allow higher confidence in pre-screening of functional protein candidates e.g. less false positives for same level of recognition accuracy which will save time and effort for in-vitro verification of the binding functions. Improved software will be made available to the public and benefit the broader research community. The work is done in collaboration with Stanford HELIX group.


Protein binding sites, Support Vector Machines, Machine Learning


Gurgen Tumanyan