Prediction in Biomolecular Systems as Classification or Linear Regression Problem
In a classification problem objects are assigned to the same or a different sets (classes) depending on their similarity or dissimilarity. In a regression problem specific values of an observable are assigned to objects. If the true class or the true observable value of an object is unknown the assignment is a prediction. There are numerous different devices for such assignment tasks. Preparing a device for prediction it first needs training using a learning set of objects, whose classes or observable values are known. During training the parameters of the prediction device are optimized. Checking that the device has learned its lesson is not prediction but recall. If the training set is small and the number of parameters large, the prediction device may be perfect for recall but prediction may be poor, since the tool has learned irrelevant details of the training set. Avoiding this learning-by-heart phenomenon (also called overfitting) requires special precaution.
In our case, the objects are molecules, which are characterized by features. The features contain information of the molecule on atomic composition, architecture, and physicochemical properties. As prediction device a scoring function is introduced connecting features with parameters. It is linear in the parameters but can be nonlinear in the features . An objective function is minimized for parameter determination. The objective function contains L1 and L2 regularization terms for feature selection and the control of overfitting. Applications are performed for molecular data bases designed for blind prediction tests (CoEPrA, BI-Kaggle), peptides binding to the major histocompatibility complex, human volume distribution and clearance of drugs, and secondary structure prediction of proteins. The latter case is a classification problem with 3 or more classes, which is an additional challenge.
Ernst-Walter Knapp is since 1991 professor for macromolecular modelling in the Institute of Chemistry and Biochemistry of the Freie Universität Berlin. He received his PhD in physics in 1976. He worked as a post-doc with Prof. Diestler at Purdue University, the late Prof. Schulten at the Max-Planck Institute of Biophysical Chemistry in Göttingen, and Prof. Fischer at the Physics Department of the Technical University of Munich, where he habilitated in Theoretical Physics in 1985 and became subsequently a Heisenberg fellow. Early scientific work dealt with theories of electron-atom and atom-molecule scattering, vibrational spectra of molecules in the condensed phase, Mössbauer spectra, protein dynamics and protein folding. Currently his scientific interests are in quantum chemistry and electrostatics of transition metal complexes, protein electrostatics, protein-protein docking, protein structure analysis and drug classification and design. He was director of the institute of chemistry and Biochemistry of the Freie Universität Berlin. He published more than 180 papers in peer-reviewed scientific journals.