A Tool for Specifying and Retrieving Data Provenance


Jenny Gutierrez

Oral Defence Date: 

Tuesday, December 11, 2007 - 17:00


TH 935


Professors Marguerite Murphy & Dragutin Petkovic


Data Provenance provides information about where a data is coming from or how a data item has been changed by algorithms or transformations. Data provenance is important in many different scientific domains due to the data intensive nature of many experiments. This paper presents the design and implementation of a tool for retrieving and specifying data provenance that allows the developer to store and retrieve the computed data provenance from the database. Supporting metadata is stored using a graph—nodes and edges—and graph transitive closure is used to compute data provenance properties. The database API is the core contribution of the project. It is implemented using Microsoft SQL Server Express 2005, Microsoft Visual Studio 2005 and C#, running in the Windows environment. Another important component of the tool is the naïve transitive closure algorithm, which is built as a user-defined stored procedure (UDP), utilizing the newly integrated Microsoft CRL runtime environment and SQL Server 2005 Database Management System. This stored procedure is used to compute the transitive closure. To demonstrate the usage of the database API, a web application was implemented for this project using Microsoft Visual Studio 2005 and ASP .NET. For demonstration purposes, cites annotation data was chosen to illustrate the derived-from relationship; but the tool has the capability of specifying and retrieving data provenance for different types of data with some modifications to the database API. Currently the tool allows a user to specify cites relationships between documents and to retrieve the data provenance (i.e. descendants and ancestors of a particular document) from the database.


Databases, Data provenance, Microsoft SQL Server, Microsoft Visual Studio, SQLCLR, Transitive Closure.


Jenny Gutierrez