A Web Search Engine-based Approach to Measure Semantic Similarity between Words

Bollegala, Danushka; Matsuo, Yutaka; Ishizuka, Mitsuru

doi:10.1109/TKDE.2010.172

by Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka

Abstract:

Measuring the semantic similarity between words is an important component in various tasks on the web such as relation extraction, community mining, document clustering, and automatic meta data extraction. Despite the usefulness of semantic similarity measures in these applications, accurately measuring semantic similarity between two words (or entities) remains a challenging task. We propose a semantic similarity measure using page counts and text snippets retrieved from a Web search engine for two words. Specifically, we define various word co-occurrence measures using page counts and integrate those with lexical patterns extracted from text snippets. To identify the numerous semantic relations that exist between two given words, we propose a novel pattern extraction algorithm and a pattern clustering algorithm. The optimal combination of page counts-based co-occurrence measures and lexical pattern clusters is learned using support vector machines. The proposed method outperforms various baselines and previously proposed web-based semantic similarity measures on three benchmark datasets showing a high correlation with human ratings. Moreover, the proposed semantic similarity measure significantly improves the accuracy in a community mining task.

View PDF

Reference:

A Web Search Engine-based Approach to Measure Semantic Similarity between Words (Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka), In IEEE Transactions on Knowledge and Data Engineering, Published by the IEEE Computer Society, volume 23, 2010.

Bibtex Entry:

@article{Bollegala2010,
abstract = {Measuring the semantic similarity between words is an important component in various tasks on the web such as relation extraction, community mining, document clustering, and automatic meta data extraction. Despite the usefulness of semantic similarity measures in these applications, accurately measuring semantic similarity between two words (or entities) remains a challenging task. We propose a semantic similarity measure using page counts and text snippets retrieved from a Web search engine for two words. Specifically, we define various word co-occurrence measures using page counts and integrate those with lexical patterns extracted from text snippets. To identify the numerous semantic relations that exist between two given words, we propose a novel pattern extraction algorithm and a pattern clustering algorithm. The optimal combination of page counts-based co-occurrence measures and lexical pattern clusters is learned using support vector machines. The proposed method outperforms various baselines and previously proposed web-based semantic similarity measures on three benchmark datasets showing a high correlation with human ratings. Moreover, the proposed semantic similarity measure significantly improves the accuracy in a community mining task.},
annote = {deneme},
author = {Bollegala, Danushka and Matsuo, Yutaka and Ishizuka, Mitsuru},
doi = {10.1109/TKDE.2010.172},
issn = {10414347},
journal = {IEEE Transactions on Knowledge and Data Engineering},
keywords = {SML-LIB-BIBLIO,lang:ENG},
mendeley-tags = {SML-LIB-BIBLIO,lang:ENG},
number = {7},
pages = {977--990},
publisher = {Published by the IEEE Computer Society},
title = {{A Web Search Engine-based Approach to Measure Semantic Similarity between Words}},
url = {http://www.computer.org/portal/web/csdl/doi/10.1109/TKDE.2010.172},
volume = {23},
year = {2010}
}