Coagen, A Program for the Creation of Synthetic DNA Sequences


Jose Eguizabal

Oral Defence Date: 



TH 331


Professor Petkovic, Professor Murphy & Adjunct Professor Buturovic


Generating synthetic DNA sequences is very important for development of computational techniques for analysis of DNA sequences and serves to compensate the lack of real data. State of the art program for generating systemic sequences is Coasim. Coasim is an open source program based on the coalescent model that can be used to simulate samples of SNP and other haplotypes/genotypes using the ancestral recombination graph. Although the program produces good results, the instructions to generate sequences are passed using Scheme programming language. This means that it is necessary for users to learn Scheme programming language before executing Coasim. For researches in areas outside Computer Science, learning a new programming language just to generate test data is not suitable. In this project we have created a new program called Coagen, which provides easy to use interface for Coasim. Coagen simplifies the process of DNA sequence creation by automatically building the scripts run by Coasim using the parameters provided by the users. This automated form of creating the scripts enables the users to create sequences in Coasim without learning Scheme programming language. We have also tested Coasim in an experiment comparing it with natural data using previously developed PhenoSVM software at SFSU. We believe that Coagen will be very useful to large number of researchers who find Coasim very hard to use, and that it will have applications in DNA sequence analysis such as in validation of programs using the machine learning algorithms.


Synthetic DNA sequences, Coasim, machine learning


Jose Eguizabal