Storing & Analyzing Methylation Data in Chado DB


Surabhi Nigamn

Oral Defence Date: 

Monday, August 17, 2009 - 13:30


TH 331


1:30 PM


Professors Murphy, Petkovic, & Chris Smith (Biology)


DNA (Deoxyribonucleic Acid) is the carrier of hereditary information from one generation to another. DNA consists of a long sequence of four kinds of nucleotide bases: Adenine (A), Guanine (G), Thyamine (T) and Cytocine (C). DNA routinely undergoes mutation in its bases. Methylation is a form of DNA structure modification where C residues in CG dinucleotides get a methyl group added. Over evolutionary time methylcytosine spontaneously mutates to a T residue and CG changes to TG. Methylation can cause genetically identical cells at different locations in the body to have a different gene expression for the same gene. Mutations in DNA interest biologists as such (mutated) genomes frequently have interesting and unique expressions. This project aims at developing software tools that can extract sites of potential methylation and other related information from the raw DNA reads used to assemble genomes. The software developed is based on a collection of five algorithms (implemented in Perl) that (1) locate potential methylation sites in a genome from raw reads based on the presence of CG<->TG mutations, (2) calculate the percentage of methylation at each potential methylation site, (3) calculate the frequency of other single nucleotide polymorphism (SNP) base mutations, and (4) calculate the rate of transition and (5) transversion in the genome. These algorithms take a text file in 454AllDiff.txt format as input and produce output in GFF3 format. These algorithms were validated for correctness by executing over a small set of test input data and manually analyzing the results; and then the algorithms were run on actual data sets consisting of the genomic data for two ant species,L.humile and P.barbatus. The complexity of each algorithm is O(n). These algorithms are implemented so that they can be used for genomic data of any species stored in 454AllDiff.txt format.

Surabhi Nigamn

Methylation, Mutation, Genomes, Reads, GFF3, 454AllDiff.txt, Algorithms