FEATURE-Words: An Information Retrieval System to Identify and Investigate Protein Functional Classes
Wei WeiOral Defence Date:
Wednesday, August 6, 2014 - 14:00Location:
Profs.Kulkarni, Petkovic, and Mike Wong
WebFEATURE is a web application that enables users to scan protein structures, using machine learning models of protein function, to locate potential active sites. One challenge for the users of WebFEATURE is that the models of protein function are cryptically named; as a result users are often unable to determine which models they want to select for predicting active sites. To address this problem we propose an information retrieval system, FEATURE-Words, that accepts plain English queries from the user, and leverages PubMed literature to identify the protein function models that are relevant to the query. The FEATURE-Words approach consists of two phases: data compilation and query evaluation. For each of the 20 protein functional classes investigated in this work, FEATURE-Words compiles a set of representative documents during the pre-processing phase of data compilation. Four protein databases (PROSITE (function), SwissProt (curated annotation), Protein Databank (structure), and PubMed (literature)) are joined to obtain the titles and abstracts of the PubMed articles that are related to each of the protein functional classes of interest. During the query evaluation phase, FEATURE-Words employs one of two document ranking algorithms, Latent Dirichlet Allocation (LDA) or Indri, to obtain a relevance-based ranking of documents for the user query. The protein functional classes that are associated with the top n documents in this ranking are inferred to be relevant to the user query. The value of n is dynamically estimated for each query based on the distribution of the documents.relevance scores. To evaluate the performance of FEATURE-Words, 25 queries were manually selected and the ground truth was collected from PROSITE and PFam.s descriptions. The best performing FEATURE-Words. configuration provides an F1 score of 0.64 (Precision: 0.57 and Recall: 0.83). A detailed error analysis reveals two recurring trends: incomplete ground truth and incorrect selection of documents during data compilation. The first category of errors can be potentially addressed by incorporating more external knowledge sources, and the second by filtering out the documents that are only tangentially related to the protein function. Both these research problems are interesting future work directions for the FEATURE-Words project.