Improving Entity Matching with Global Constraints

Wednesday, March 2, 2011 - 16:30
TH 331
Jim Gemmell (Microsoft)
The well-known entity matching or entity resolution problem, and the related problem of record linkage from statistics, is an important step in merging datasets from different (possibly heterogeneous) sources. As each database may contain different views of common entities, it is often desirable to match the databases' rows for equivalence. While the related problem of deduplication---matching equivalent rows within a single database--yields a dataset that uniquely records entities, previous work on matching pairs of databases has focused on many-to-many matchings. We propose the one-to-one entity matching problem, motivating it as task of direct interest for search and the semantic web. We present machine learning and graph algorithm based approaches for constrained entity matching, and demonstrate that adding global constraints to the common technique of score-based matching significantly improves precision and recall on real-world data.

Jim Gemmell is a senior researcher at Microsoft Research, currently working on the next generation of search. He holds a PhD in computer science from Simon Fraser University. Jim is a co-author with Gordon Bell of the book Your Life Uploaded, based on their MyLifeBits research project, which spawned an entire research community. He has also done research on the topics of personal media management, telepresence, and reliable multicast. His research has led to features in Windows XP, Windows Server 2008, Windows 7, and Bing.