CE-09.09

Title: 

Entering and Configuring Complex Queries for Chado

Author(s): 

Vaibhav Khedekar

Oral Defence Date: 

07/27/2009

Location: 

SCI 241

Committee: 

Professor Murphy, Professor Petkovic,and Assistant Prof. Chris Smith (Biology)

Abstract: 

Collection of biological data has increased exponentially over the past few years. This has led to the need for more advanced biological databases to diverse data types including sequence, video, literature, etc. Every year, new modules (consisting of schema elements and stored procedures) are being added to existing large biological databases, such as Chado [1]. This not only increases the amount of data in the database, but also increases the complexity of the schema. Biological data is increasingly quantitative, requiring more advanced databases and data modeling. Biological databases contain complex data types that vary from species to species. Researchers who understand the biology rarely understand how the database is made and so they are unable to effectively query the database. This project aims to create simpler tools for naïve users to query more effectively. In order for biologists to retrieve meaningful data from such large and complex databases, efficient query generation techniques need to be developed. In this report, we present extensions to a simple, yet efficient, query generation system. The original version of this tool [2] allows the user to build simple ‘Select-Project-Join’ (SPJ) queries, where the JOIN between multiple tables is made efficient by using a shortest path algorithm to compute the minimum number of joins over primary key-foreign key pairs in the underlying schema. In the work presented here, we extend this tool to allow the user to create SQL (‘pSQL’ in case of Chado) queries having more complex constructs, including aggregate functions, group by and order by clauses, and to manually edit the queries that are automatically generated. Multi-query scripts are supported and can be embedded in Perl, the scripting language that is most familiar to biologists. Query results are presented in a tabular format (as an HTML file), for the convenience of the user. Our system is fully functional and correctly executes over any well-formed SQL schema; the full Chado schema is used for demonstration purposes in this report.

Keywords: 

Complex Chado Queries, Chado Schema, Aggregate Queries, GMOD

Copyright: 

Vaibhav Khedekar