Barbara Eckman, IBM Life Sciences
Biological research poses significant data integration and management challenges. To identify and characterize regions of functional interest in genomic sequence requires full, flexible query access to an integrated, up-to-date view of all related information, irrespective of where it is stored (within an organization or across the Internet) and its format (traditional database, semi-structured text file, web site, results of runtime analysis). Wide-ranging multi-source queries often return unmanageably large result sets, requiring non-traditional approaches to exclude extraneous data. As high-throughput biology generates large volumes of data about the "parts list" of living organisms, there is a growing need for robust, efficient systems to manage metabolic and signaling pathways, gene regulatory networks, protein interaction networks, and a variety of annotations on the network components. Pathway data frequently is best represented as graphs, and researchers need to navigate, query and manipulate this data in ways that may not be well supported by standard relational database tools.
The increasingly large size of biological data repositories, their complexity, and the requirement for efficient, robust data management, make extending a mature data management technology a compelling choice. This talk presents three approaches for meeting these challenges through extensions to DB2, IBM's relational database management system. 1) IBM's DB2 Information Integrator is a federated database middleware system that provides SQL access to data from multiple heterogeneous sources. Sources accessed include both relational databases and non-relational Life Science data sources: web sites like NCBI's GenBank and PubMed, XML documents, BioRS databanks, text documents, Excel spreadsheets, Documentum docbases, and results of runtime analyses such as BLAST, HMMER, and GeneWise. 2) IBM offers a suite of User-Defined Functions that "build science into DB2" by providing basic sequence manipulations, generalized pattern-matching, and other bioinformatics-specific operations from within a SQL query. 3) A project is currently in development in IBM Research to extend DB2 with graph objects and operations to support data management in systems biology. Like the User-Defined Functions, when used in combination with DiscoveryLink these graph operations may be applied to data stored in any format, whether remote or local, relational or non-relational.
The usefulness of these approaches is demonstrated by real-world examples from basic science and pharmaceutical research. Examples include: "Return all human EST sequences that are >60% identical over 50 AA to mouse channel genes expressed in central nervous system tissue." "Using CED4, the C elegans regulator of cell death, as the BLAST query sequence against the non-redundant protein database nr, return only alignments in which the subject sequence includes the P-loop ATPase domain [GA]xxxGK[ST]." [Zhang et al (1998). Nucl. Acids Res., 26, 3896-3990]. "Predict protein-protein interactions and pathways in an organism of interest based on relationships of orthologous proteins in other organisms." "Find all proteins related to protein A (i.e. within a given path length of A) within a protein interaction graph."Find all proteins connected by paths of at most a certain length to proteins known to have a certain biological function."