Improving Performance of Queries for Chado Database Schema


Prachi Gujar

Oral Defence Date: 

Wednesday, November 19, 2008 - 15:10


SCI 241


3:10 PM


Professors Marguerite Murphy, Dragutin Petkovic, C. Smith


The use of advanced biological databases has increased greatly over the past few years due to an exponential growth in biological data collected. Biological data is highly complex and the design of the underlying database schema is equally complex. It is necessary to represent heterogeneous data correctly and to ensure that none of the data is lost during storage and processing. The Chado Schema is a widely used relational database schema that handles biological knowledge for a variety of organisms, from humans to pathogens. This information includes genome sequences, phenotype and genotype data. The Generic Organism Model Database (GMOD) standard Chado Schema has 133 relations with more than 200 relationships. Moreover, some of the attribute names across different tables are identical. This complexity makes it difficult for the user to accurately query and retrieve data to and from a Chado Database. A simple query strategy is to join all of the tables, and then select the subset of data that is of interest. This is not an optimal strategy, as even a simple query will require joining all 133 tables. The Query Generating System presented in this report was designed and developed at San Francisco State University to automate the process of efficient query generation for select-project-join queries. This system allows the user to select the query output attributes and selection criteria, then a Chado query with a minimum number of joins is generated. Internally, our program represents the full Chado Schema as a graph, and uses an unweighed shortest path algorithm to find the minimal group of intermediate relations that are required to be joined in order to accurately query the database. The system we designed is completely operational and accurately automates the process of optimal query generation for Chado Databases.

Prachi Gujar

Query Generation System, Unweighed Shortest Path Algorithm, Chado Schema