Interval Overlap Tool for Genome Data Stored in Chado DB
Oral Defence Date:
Professors Murphy, Petkovic(CS) & C. Smith(Biology)
In principle, genomes contain all of the biological information required to build a living instance of an organism. DNA sequence scaffolds are generated during the genome sequencing process and can contain multiple genes as well as other features. These features may overlap with each other. The problem addressed in this project is to develop a software tool (Interval Overlap Tool) that can eliminate the need of writing hand crafted scripts to identify intervals stored in a GFF3 file that overlap features of a specific type stored in a Chado formatted database. For example, a user can search for all genes stored in a database that have at least a 50% overlap with a set of intervals stored in an external GFF3 formatted file. The Interval Overlap Tool is built on two core algorithms. Algorithm I takes the GFF3 formatted file as input and generates a set of output features by querying a Chado formatted database instance. Algorithm II is responsible for calculating overlap percentage, using the output of Algorithm I, and writing output files. The author's contributions to this project are to do analysis of different sorts of possible overlap between two features and design and develop an analysis technique that can cover all possible cases, implementation of both algorithms in a programming language and release the Interval Overlap Tool as an Open Source distribution. The Interval Overlap Tool works correctly; the author has validated output correctness using test data and manually analyzing results. Computed results can be exported in two standard file formats GFF3 and FASTA.
Features, Genome, GFF3, FASTA, Interval, Overlap