Population Assignment Problem Utilizing Support Vector Machine Algorithm and Domain-Specific Distance Measures
Oral Defence Date:
Tuesday, September 28, 2010 - 13:00
Professors D. Petkovic, Sarah Cohen (Biology), & Ljubomir Buturovic (adjunct)
The goal of this project is to introduce domain specific (evolutionary) distances in the Support Vector Machine (SVM) based algorithm for associating Single Nucleotide Polymorphisms (SNP) and phenotype. In the SVM approach, the strength of association is measured by determining how accurately individuals are classified into populations using the SNP data. The SVM algorithm uses pair-wise distances between input sequences for classification of individuals into populations (groups). The project aims at optimizing these pair-wise distances, in order to obtain a better association of SNP and phenotype, leading to a better accuracy of prediction of an individual into a group. The optimization is achieved by deriving the best model of evolution from the genomic data, resulting in optimum evolutionary distances between the nucleotide sequences. Unlike the original decimal distance measure, the evolutionary distance measure takes into account the importance of polymorphisms between the nucleotide sequences. The ultimate objective of the project is to evaluate the effect of using evolutionary distance measure on the performance of PhenoSVM, i.e. if it will result in better classification of individuals into populations. The project is implemented by using phylogenetic data analysis tools to obtain optimum model of evolution, resulting in optimum distances to be used for population assignment problem, followed by the use of these distances in PhenoSVM algorithm. The final results showed that accuracy of classification by using model based sequence encoding was inconclusive; for some test cases, model based sequence encoding yielded better results for classification while for others decimal based sequence encoding gave better results, motivating further research in this area.
Genotype-phenotype association, Single Nucleotide Polymorphism, Support Vector Machine algorithm, phylogenetics