The Role of Local and Global Weighting in Assessing the Semantic Similarity of Texts Using Latent Semantic Analysis.

return to the website
by Mihai C Lintean, Cristian Moldovan, Vasile Rus, Danielle S McNamara
Abstract:
In this paper, we investigate the impact of several local and global weighting schemes on Latent Semantic Analy- sis’ (LSA) ability to capture semantic similarity between two texts. We worked with texts varying in size from sentences to paragraphs. We present a comparison of 3 local and 3 global weighting schemes across 3 different standardized data sets related to semantic similarity tasks. For local weighting, we used binary weighting, term-frequency, and log-type. For global weighting, we relied on binary, inverted document fre- quencies (IDF) collected from the English Wikipedia, and en- tropy, which is the standard weighting scheme used by most LSA-based applications. We studied all possible combina- tions of these weighting schemes on the following three tasks and corresponding data sets: paraphrase identification at sen- tence level using the Microsoft Research Paraphrase Corpus, paraphrase identification at sentence level using data from the intelligent tutoring system iSTART, and mental model de- tection based on student-articulated paragraphs in MetaTu- tor, another intelligent tutoring system. Our experiments re- vealed that for sentence-level texts a combination of type fre- quency local weighting in combination with either IDF or bi- nary global weighting works best. For paragraph-level texts, a log-type local weighting in combination with binary global weighting works best. We also found that global weights have a greater impact for sententence-level similarity as the local weight is undermined by the small size of such texts.
Reference:
The Role of Local and Global Weighting in Assessing the Semantic Similarity of Texts Using Latent Semantic Analysis. (Mihai C Lintean, Cristian Moldovan, Vasile Rus, Danielle S McNamara), In FLAIRS Conference, 2010.
Bibtex Entry:
@inproceedings{lintean2010role,
abstract = {In this paper, we investigate the impact of several local and global weighting schemes on Latent Semantic Analy- sis’ (LSA) ability to capture semantic similarity between two texts. We worked with texts varying in size from sentences to paragraphs. We present a comparison of 3 local and 3 global weighting schemes across 3 different standardized data sets related to semantic similarity tasks. For local weighting, we used binary weighting, term-frequency, and log-type. For global weighting, we relied on binary, inverted document fre- quencies (IDF) collected from the English Wikipedia, and en- tropy, which is the standard weighting scheme used by most LSA-based applications. We studied all possible combina- tions of these weighting schemes on the following three tasks and corresponding data sets: paraphrase identification at sen- tence level using the Microsoft Research Paraphrase Corpus, paraphrase identification at sentence level using data from the intelligent tutoring system iSTART, and mental model de- tection based on student-articulated paragraphs in MetaTu- tor, another intelligent tutoring system. Our experiments re- vealed that for sentence-level texts a combination of type fre- quency local weighting in combination with either IDF or bi- nary global weighting works best. For paragraph-level texts, a log-type local weighting in combination with binary global weighting works best. We also found that global weights have a greater impact for sententence-level similarity as the local weight is undermined by the small size of such texts.},
author = {Lintean, Mihai C and Moldovan, Cristian and Rus, Vasile and McNamara, Danielle S},
booktitle = {FLAIRS Conference},
keywords = {SML-LIB-BIBLIO,lang:ENG},
mendeley-tags = {SML-LIB-BIBLIO,lang:ENG},
title = {{The Role of Local and Global Weighting in Assessing the Semantic Similarity of Texts Using Latent Semantic Analysis.}},
year = {2010}
}
Powered by bibtexbrowser