Measuring semantic similarity between words using web search engines

Bollegala, Danushka

doi:10.1145/1242572.1242675

by Danushka Bollegala

Abstract:

Measuring semantic similarity between words is vital for various applications in natural language processing, such as language modeling, information retrieval, and document clustering. We propose a method that utilizes the information available on the Web to measure semantic similarity between a pair of words or entities. We integrate page counts for each word in the pair and lexico-syntactic patterns that occur among the top ranking snippets for the AND query using support vector machines. Experimental results on Miller-Charles ’ benchmark data set show that the proposed measure outperforms all the existing web based semantic similarity measures by a wide margin, achieving a correlation coefficient of 0.834. Moreover, the proposed semantic similarity measure significantly improves the accuracy (F-measure of 0.78) in a named entity clustering task, proving the capability of the proposed measure to capture semantic similarity using web content. 1

View PDF

Reference:

Measuring semantic similarity between words using web search engines (Danushka Bollegala), In Proceedings of the 16th international conference on World Wide Web - WWW 07, ACM Press, 2007.

Bibtex Entry:

@article{Bollegala2007,
abstract = {Measuring semantic similarity between words is vital for various applications in natural language processing, such as language modeling, information retrieval, and document clustering. We propose a method that utilizes the information available on the Web to measure semantic similarity between a pair of words or entities. We integrate page counts for each word in the pair and lexico-syntactic patterns that occur among the top ranking snippets for the AND query using support vector machines. Experimental results on Miller-Charles ’ benchmark data set show that the proposed measure outperforms all the existing web based semantic similarity measures by a wide margin, achieving a correlation coefficient of 0.834. Moreover, the proposed semantic similarity measure significantly improves the accuracy (F-measure of 0.78) in a named entity clustering task, proving the capability of the proposed measure to capture semantic similarity using web content. 1},
address = {New York, New York, USA},
annote = {
        From Duplicate 1 ( 
        
        
          Measuring semantic similarity between words using web search engines
        
        
         - Bollegala, Danushka )

        
        

        

        

      },
author = {Bollegala, Danushka},
doi = {10.1145/1242572.1242675},
isbn = {9781595936547},
journal = {Proceedings of the 16th international conference on World Wide Web - WWW 07},
keywords = {SML-LIB-BIBLIO,lang:ENG,semantic similarity},
mendeley-tags = {SML-LIB-BIBLIO,lang:ENG,semantic similarity},
pages = {757},
publisher = {ACM Press},
title = {{Measuring semantic similarity between words using web search engines}},
url = {http://portal.acm.org/citation.cfm?doid=1242572.1242675 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.78.4602},
year = {2007}
}