Software Pipeline for Discovering Intron Polymorphisms from Samples of Closely Related Organisms


Robert Theis

Oral Defence Date: 



TH 434


Professors Marguerite Murphy, Dragutin Petkovic; and Scott Roy (Biology)


Introns are the portions of genomic DNA sequence that are removed from RNA transcripts by splicing during RNA processing. Specific introns have been shown to have diverse functions, however the evolutionary origins and general function of introns remain mysterious. A software pipeline has been created for identifying insertions and deletions of intron sequences when given a reference genome along with paired-end sequencing reads sampled from closely related organisms. To locate intron insertion and deletion sites, the pipeline identifies mate pairs in which one of the two mates fails to align to the reference genome, as is expected for reads that overlap the insertion or deletion. Once identified, the unmapped mates are assembled into contigs and aligned to the reference genome for the identification of insertion/deletion regions. The data processing stages of the pipeline have been implemented as subroutines in a Perl module. A graphical user interface has been developed for running the pipeline within the Galaxy web browser based bioinformatics platform, enabling sharing of data and workflow reproducibility. The contribution made toward solving this problem is the implementation of a Perl module and associated scripts, the creation of a Galaxy tool definition file, and the running of the pipeline on a data sample from Micromonas.


Introns, Introners, Insertion/deletions, Next generation sequencing, Genome structure polymorphism


Robert Theis