A new semantic similarity metric for solving sparse data problem in ontology based information retrieval system

Saruladha, K; Aghila, Gnanasekaran; Raj, Sajina

by K Saruladha, Gnanasekaran Aghila, Sajina Raj

Abstract:

Semantic similarity assessing methods play a central role in many research areas such as Psychology, cognitive science, information retrieval biomedicine and Artificial intelligence. This paper discuss the existing semantic similarity assessing methods and identify how these could be exploited to calculate accurately the semantic similarity of WordNet concepts. The semantic similarity approaches could broadly be classified into three different categories: Ontology based approaches (structural approach), information theoretic approaches (corpus based approach) and hybrid approaches. All of these similarity measures are expected to preferably adhere to certain basic properties of information. The survey revealed the following drawbacks The information theoretic measures are dependent on the corpus and the presence or absence of a concept in the corpus affects the information content metric. For the concepts not present in the corpus the value of information content tends to become zero or infinity and hence the semantic similarity measure calculated based on this metric do not reflect the actual information content of the concept. Hence in this paper we propose a new information content metric which provides a solution to the sparse data problem prevalent in corpus based approaches. The proposed measure is corpus independent and takes into consideration hyponomy and meronomy relations. Empirical studies of finding similarity of R&G data set using existing Resnik, lin and J& C semantic similarity methods with the proposed based on the proposed information content metric and hypernym relations. The correctness of the information content metric proposed is to be proved by comparing the results against the human judgments available for R& G set. Further the information content metric used earlier by Resnik, lin and Jiang and Cornath methods may produce better results with alternate corpora other than brown corpus. Hence the effect of corpus based information content metric on alternate corpora is also investigated. information content metric is to be studied. We also propose a new semantic similarity measure

Reference:

A new semantic similarity metric for solving sparse data problem in ontology based information retrieval system (K Saruladha, Gnanasekaran Aghila, Sajina Raj), In International Journal of Computer Science Issues, volume 7, 2010.

Bibtex Entry:

@article{Saruladha2010a,
abstract = {Semantic similarity assessing methods play a central role in many research areas such as Psychology, cognitive science, information retrieval biomedicine and Artificial intelligence. This paper discuss the existing semantic similarity assessing methods and identify how these could be exploited to calculate accurately the semantic similarity of WordNet concepts. The semantic similarity approaches could broadly be classified into three different categories: Ontology based approaches (structural approach), information theoretic approaches (corpus based approach) and hybrid approaches. All of these similarity measures are expected to preferably adhere to certain basic properties of information. The survey revealed the following drawbacks The information theoretic measures are dependent on the corpus and the presence or absence of a concept in the corpus affects the information content metric. For the concepts not present in the corpus the value of information content tends to become zero or infinity and hence the semantic similarity measure calculated based on this metric do not reflect the actual information content of the concept. Hence in this paper we propose a new information content metric which provides a solution to the sparse data problem prevalent in corpus based approaches. The proposed measure is corpus independent and takes into consideration hyponomy and meronomy relations. Empirical studies of finding similarity of R\&G data set using existing Resnik, lin and J\& C semantic similarity methods with the proposed based on the proposed information content metric and hypernym relations. The correctness of the information content metric proposed is to be proved by comparing the results against the human judgments available for R\& G set. Further the information content metric used earlier by Resnik, lin and Jiang and Cornath methods may produce better results with alternate corpora other than brown corpus. Hence the effect of corpus based information content metric on alternate corpora is also investigated. information content metric is to be studied. We also propose a new semantic similarity measure},
author = {Saruladha, K and Aghila, Gnanasekaran and Raj, Sajina},
journal = {International Journal of Computer Science Issues},
keywords = {Ontology,SML-LIB-BIBLIO,Semantic Similarity,conceptual similarity,corpus based,information content,information retrieval,lang:ENG,similarity method,taxonomy},
mendeley-tags = {SML-LIB-BIBLIO,Semantic Similarity,information content,lang:ENG},
number = {3},
pages = {40--48},
title = {{A new semantic similarity metric for solving sparse data problem in ontology based information retrieval system}},
volume = {7},
year = {2010}
}