Benchmarks

This section is dedicated to the benchmarks commonly used to evaluate semantic measures.

Please send an email to the community if you want to add a benchmark which is not specified below.

General Benchmarks

Rubenstein and Goodenough - 1965

Reference: H. Rubenstein and J.B. Goodenough. (1965). "Contextual correlates of synonymy". Communications of the ACM, 8(10): 627-633. citations

Miller and Charles - 1991

Subset of Rubenstein and Goodenough benchmark.

Reference: Miller GA, Charles WG: Contextual Correlates of Semantic Similarity. Language & Cognitive Processes 1991, 6:1–28. citations

Finkelstein et al. - 2002 - WordSimilarity-353

The WordSimilarity-353 Test Collection contains two sets of English word pairs along with human-assigned similarity judgements. The collection can be used to train and/or test computer algorithms implementing semantic similarity measures (i.e., algorithms that numerically estimate similarity of natural language words) (source).

Reference: L. Finkelstein, E. Gabrilovich , Y. Matias, E. Rivlin, Z. Solan, G. Wolfman and E. Ruppin. (2002) "Placing Search in Context: The Concept Revisited". ACM Transactions on Information Systems, 20(1):116- 131. citations

Halawi & Dror - 2012 - Mturk-771

The Mturk-771 Test Collection contains 771 English word pairs along with human-assigned relatedness judgements. The collection can be used to train and/or test computer algorithms implementing semantic relatedness measures (i.e., algorithms that numerically estimate relatedness of natural language words) (source).

Reference: Halawi, G., Dror, G., Gabrilovich, E., & Koren, Y. (2012, August). Large-scale learning of word relatedness with constraints. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 1406-1414). ACM. citations

Domain specific benchmarks

Biomedical domain

Pedersen et al. - 2007

A subset of 29 medical concept pairs manually rater by medical coders for semantic relatedenss with high inter-rater agreement. (source University of Minnesota Pharmacy Informatics Lab)

Download the benchmark from the University of Minnesota Pharmacy Informatics Lab.

Reference: Measures of semantic similarity and relatedness in the biomedical domain. Pedersen T., Pakhomov S.V.S., Patwardhan S., and Chute C.G. Journal of Biomedical Informatics. 2007;40(3):288-299.

Pakhomov et al. - 2010
Semantic similarity

A set of 566 UMLS concept pairs manually rated for semantic similarity using a continuous response scale. (source University of Minnesota Pharmacy Informatics Lab)
Download the benchmark from the University of Minnesota Pharmacy Informatics Lab.

Semantic relatedness

A set of 587 UMLS concept pairs manually rated for semantic relatedness using a continuous response scale. (source University of Minnesota Pharmacy Informatics Lab)
Download the benchmark from the University of Minnesota Pharmacy Informatics Lab.

Reference: Semantic Similarity and Relatedness between Clinical Terms: An Experimental Study. Pakhomov S., McInnes, B., Adams, T., Liu, Y., Pedersen, T. and Melton, G.B. To Appear in the Proceedings of the Annual Symposium of the American Medical Informatics Association. Washington, D.C. November, 2010.

Gene Ontology related benchmarks

Tools
CESSM

CESSM is an online tool for the automated evaluation of GO-based semantic similarity measures, that enables the comparison of new measures against previously published ones in terms of performance against sequence, Pfam and EC similarity. (source)

Reference: Catia Pesquita, Delphine Pessoa, Daniel Faria, Francisco Couto, CESSM: Collaborative Evaluation of Semantic Similarity Measures.JB2009: Challenges in Bioinformatics November, 2009.