Validation of Support Vector Machine Techniques for Assessment of Genotype-Phenotype Association


Danielle Gilbert

Oral Defence Date: 



TH 935


Professors Petkovic, Buturovic, and Cohen


The goal of this project was to conduct the validation of a releasable software package that measures the strength of association between DNA sequence information and a physical trait. The software package, PhenoSVM, uses the machine learning technique of support vector machines (SVMs) to classify input sequences into two or more populations, then uses its error rate of classification to measure genotype-phenotype association. In this project, we first generated two types of synthetic data that were used to analyze the PhenoSVM output to determine the accuracy of prediction and validate the SVM-based method for population assignment, then we developed a procedure based on backward selection methods to identify the optimal subset of differences in the DNA sequences (in this case, single nucleotide polymorphisms, SNPs) associated with each phenotype. Finally, we utilized this backward selection procedure to confirm our hypothesis that the probability of correct population assignment reflects the strength of association between a given set of SNPs and a phenotype. PhenoSVM is thus validated as a useful software tool for the analysis of biological data.


genotype, phenotype, single nucleotide polymorphism, SNP, machine learning, support vector machine, PhenoSVM, validation, backward selection


Danielle Gilbert