| Abstract: |
Collection of biological data has increased exponentially over the past few years. This has led to
the need for more advanced biological databases to diverse data types including sequence, video,
literature, etc. Every year, new modules (consisting of schema elements and stored procedures) are
being added to existing large biological databases, such as Chado [1]. This not only increases the
amount of data in the database, but also increases the complexity of the schema. Biological data is
increasingly quantitative, requiring more advanced databases and data modeling. Biological databases
contain complex data types that vary from species to species. Researchers who understand the biology
rarely understand how the database is made and so they are unable to effectively query the database.
This project aims to create simpler tools for naïve users to query more effectively. In order for
biologists to retrieve meaningful data from such large and complex databases, efficient query
generation techniques need to be developed. In this report, we present extensions to a simple, yet
efficient, query generation system. The original version of this tool [2] allows the user to build
simple ‘Select-Project-Join’ (SPJ) queries, where the JOIN between multiple tables is made efficient
by using a shortest path algorithm to compute the minimum number of joins over primary key-foreign
key pairs in the underlying schema. In the work presented here, we extend this tool to allow the
user to create SQL (‘pSQL’ in case of Chado) queries having more complex constructs, including
aggregate functions, group by and order by clauses, and to manually edit the queries that are
automatically generated. Multi-query scripts are supported and can be embedded in Perl, the
scripting language that is most familiar to biologists. Query results are presented in a tabular
format (as an HTML file), for the convenience of the user. Our system is fully functional and
correctly executes over any well-formed SQL schema; the full Chado schema is used for demonstration
purposes in this report.
|