Highly Loss-Sensitive Classification of Protein Crystallizaton Images


Oliver Newland

Oral Defence Date: 



TH 434


Professors Kazunori Okada, William Hsu and Barry Levine


Protein crystallization is a technique used by medical and biochemical researchers to study the internal structures of proteins. It involves mixing known proteins with formulated chemical cocktails, and using those with crystal formations to visually inspect the internal structure. An issue crystallographers face with this technique is that most formulations fail, and the conditions for making formulations succeed are largeley unknown. To account for this situation, crystallographers often perform massive trials in parallel, automatically mixing cocktails and photographing the resulting solutions. Most laboratories still identify crystals by manual inspection. To obtain a high-accuracy and automatic classification of crystalline images in protein crystallization trials, we propose a machine-learning based approach. This approach takes a large number of bag-of-words features, of multiple types. Our experiments were performed on a labelled data set provided by Genentech. We evaluated the efficacy of multiple feature designs and classification methods. Through our experimentation, we find that over-fitting is a constant and high-grade risk given this data set and our large feature dimensionality. Testing multiple methods, we demonstrate that a balanced approach for minimizing "precipitate" and "crystal" loss is realized by feature selection using Elastic Net followed by classification using Support Vector Machines.


biomedical imaging, classification, machine learning, feature selection


Oliver Newland