Computing Semantic Similarity Using Large Static Corpora

András Dobó, János Csirik

doi:10.1007/978-3-642-35843-2

by János Csirik András Dobó

Abstract:

Measuring semantic similarity of words is of crucial importance in Natural Language Processing. Although there are many different approaches for this task, there is still room for improvement. In contrast to many other methods that use web search engines or large lexical databases, we developed such methods that solely rely on large static corpora. They create a binary or numerical feature vector for each word making use of statistical information obtained from the corpora. These vectors contain features based on context words or grammatical relations extracted from the corpora and they employ diverse weighting schemes. After creating the feature vectors, word similarity is calculated using various vector similarity measures. Beside the individual methods, their combinations were also tested. Evaluated on both the Miller-Charles dataset and the TOEFL synonym questions, they achieve competitive results to recent methods.

View PDF

Reference:

Computing Semantic Similarity Using Large Static Corpora (János Csirik András Dobó), (Peter Emde Boas, Frans C. A. Groen, Giuseppe F. Italiano, Jerzy Nawrocki, Harald Sack, eds.), Springer Berlin Heidelberg, volume 7741, 2013.

Bibtex Entry:

@book{AndrasDobo2013,
abstract = {Measuring semantic similarity of words is of crucial importance in Natural Language Processing. Although there are many different approaches for this task, there is still room for improvement. In contrast to many other methods that use web search engines or large lexical databases, we developed such methods that solely rely on large static corpora. They create a binary or numerical feature vector for each word making use of statistical information obtained from the corpora. These vectors contain features based on context words or grammatical relations extracted from the corpora and they employ diverse weighting schemes. After creating the feature vectors, word similarity is calculated using various vector similarity measures. Beside the individual methods, their combinations were also tested. Evaluated on both the Miller-Charles dataset and the TOEFL synonym questions, they achieve competitive results to recent methods.},
address = {Berlin, Heidelberg},
author = {{Andr\'{a}s Dob\'{o}}, J\'{a}nos Csirik},
doi = {10.1007/978-3-642-35843-2},
editor = {{Emde Boas}, Peter and Groen, Frans C. A. and Italiano, Giuseppe F. and Nawrocki, Jerzy and Sack, Harald},
isbn = {978-3-642-35842-5},
keywords = {SML-LIB-BIBLIO},
mendeley-tags = {SML-LIB-BIBLIO},
publisher = {Springer Berlin Heidelberg},
series = {Lecture Notes in Computer Science},
title = {{Computing Semantic Similarity Using Large Static Corpora}},
url = {http://www.springerlink.com/index/10.1007/978-3-642-35843-2},
volume = {7741},
year = {2013}
}