Improving graph-walk-based similarity with reranking: Case studies for personal information management

Minkov, Einat; Cohen, William W

doi:10.1145/1877766.1877770

by Einat Minkov, William W Cohen

Abstract:

Relational or semistructured data is naturally represented by a graph, where nodes denote entities and directed typed edges represent the relations between them. Such graphs are heterogeneous, describing different types of objects and links. We represent personal information as a graph that includes messages, terms, persons, dates, and other object types, and relations like sent-to and has-term. Given the graph, we apply finite random graph walks to induce a measure of entity similarity, which can be viewed as a tool for performing search in the graph. Experiments conducted using personal email collections derived from the Enron corpus and other corpora show how the different tasks of alias finding, threading, and person name disambiguation can be all addressed as search queries in this framework, where the graph-walk-based similarity metric is preferable to alternative approaches, and further improvements are achieved with learning. While researchers have suggested to tune edge weight parameters to optimize the graph walk performance per task, we apply reranking to improve the graph walk results, using features that describe high-level information such as the paths traversed in the walk. High performance, together with practical runtimes, suggest that the described framework is a useful search system in the PIM domain, as well as in other semistructured domains.

View PDF

Reference:

Improving graph-walk-based similarity with reranking: Case studies for personal information management (Einat Minkov, William W Cohen), In ACM Transactions on Information Systems, volume 29, 2010.

Bibtex Entry:

@article{Minkov2010,
abstract = {Relational or semistructured data is naturally represented by a graph, where nodes denote entities and directed typed edges represent the relations between them. Such graphs are heterogeneous, describing different types of objects and links. We represent personal information as a graph that includes messages, terms, persons, dates, and other object types, and relations like sent-to and has-term. Given the graph, we apply finite random graph walks to induce a measure of entity similarity, which can be viewed as a tool for performing search in the graph. Experiments conducted using personal email collections derived from the Enron corpus and other corpora show how the different tasks of alias finding, threading, and person name disambiguation can be all addressed as search queries in this framework, where the graph-walk-based similarity metric is preferable to alternative approaches, and further improvements are achieved with learning. While researchers have suggested to tune edge weight parameters to optimize the graph walk performance per task, we apply reranking to improve the graph walk results, using features that describe high-level information such as the paths traversed in the walk. High performance, together with practical runtimes, suggest that the described framework is a useful search system in the PIM domain, as well as in other semistructured domains.},
author = {Minkov, Einat and Cohen, William W},
doi = {10.1145/1877766.1877770},
issn = {10468188},
journal = {ACM Transactions on Information Systems},
keywords = {SML-LIB-BIBLIO,lang:ENG},
mendeley-tags = {SML-LIB-BIBLIO,lang:ENG},
month = dec,
number = {1},
pages = {1--52},
title = {{Improving graph-walk-based similarity with reranking: Case studies for personal information management}},
url = {http://portal.acm.org/citation.cfm?doid=1877766.1877770},
volume = {29},
year = {2010}
}