Graph-based Semantic Measures: Technical section
This section is dedicated to technical aspects of semantic measures.
Measures used to estimate the similarity or relatedness of concepts or groups of concepts are detailed.
This section also aims at referencing some of the metrics used by semantic measures e.g. metrics which evaluate the specificity of concepts.
Quick Summary:
- Specificity of concepts: this section is dedicated to the various metrics used to evaluate the specificity of concepts e.g. their information content.
- Pairwise measures: Technical discussion related to measures used to compare a pair of concepts. A listing of existing measures is also provided.
- Groupwise measures: Technical discussion related to measures used to compare a group of concepts. A listing of existing measures is also provided.
Estimation of concept specificity
Two broad categories of measures have been defined to estimate specificity of concepts. The intrinsic approach only takes into account the topological properties of the taxonomic backbone of the semantic graph. The extrinsic approach, introduced by Resnik [1], extends the intrinsic approach also considering the usage of a concept to assess its specificity.
Since taxonomic relationships are transitive, an occurrence of the concept $y$ is also a occurrence of the concept $x$ if $y \prec x$, i.e. $x$ subsumes $y$. In other word, considering that Mammal subsumes Human, every time you read about Human you also read about Mammal. The concept Mammal is therefore more specific than all its subsumers. Otherwise stated, we can say that Mammal is also more informative. Think about a detective story; if I say 'A Mammal killed him', it would be less informative (discriminating) for the detective than saying 'A Human killed him'. The intrinsic approach rely on topological information to assess concept specificity e.g. depth of concepts in the semantic graph.
The extrinsic approach takes into account of the occurrences of concepts. Indeed, the intrinsic approach, in some way, only consider the schema associated to the ontology. However, adopting Shannon's theory of information, Resnik proposed to define the specificity of a concept as a function of its information content, i.e. the amount of information a concept carries. Therefore, more a concept is used or represented in a collection, less informative (specific) it is.
In the following documentation, we consider $\theta$ as a function estimating concept specificity, with:
- $\theta : \mathcal{C} \rightarrow \mathbb{R}^+$
In all cases requirement on measures evaluating the specificity of concepts.
Due to the partial order induced by the taxonomic subgraph, a concept must always be more specific than its ancestors (subsumers).
A constraint of $\theta$ functions is thus:
- $y \prec x \rightarrow \theta(y) > \theta(x)$
Both, intrinsic and extrinsic approaches are detailed below.
Intrinsic estimation
Depth of concepts
The depth of a concept can be used to estimate its specificity, i.e. deeper is a concept in the ontology, more specific it is expected to be. Numerous $\theta$ function can be defined based on the depth of a concept.
Since a concept can have multiple depth in a taxonomic graph, i.e. depending on the path linking the concept to the root which is considered, a min or max (among others) strategy can be used.
In some cases, a logarithm function can be used to obtain non-linear decreasing function from the leaves to the root of the ontology:
Note the equation presented above is a special case of Zhou function presented below (i.e. Zhou function with k set to nil). Talk about approximations proposed by Blanchard et al.
Number of ancestors
Number of ancestors can also be used as an estimator of specificity. This estimator follows the same strategy as the depth but is theoretically better suited for graph as aggregations of multiple depth are not required anymore.
Resnik - 1995
Seco et al. - 2004
The specificity of concept can also be evaluated according to the number of concepts it subsumes.
Reference: Seco N, Veale T, Hayes J: An Intrinsic Information Content Metric for Semantic Similarity in WordNet. In 16th European Conference on Artificial Intelligence. IOS Press; 2004, 16:1-5.
Zhou et al. - 2008
Multiple estimators can also be mixed.
With k a parameter enabling to tune the contribution of each estimators.
Reference: Zhou Z, Wang Y, Gu J: A New Model of Information Content for Semantic Similarity in WordNet. In FGCNS ’08 Proceedings of the 2008 Second International Conference on Future Generation Communication and Networking Symposia Volume 03. IEEE Computer Society; 2008:85-89.
Sanchez et al. - 2011
Sanchez D, Batet M, Isern D: Ontology-based information content computation. Knowledge-Based Systems 2011, 24:297-303.
Harispe et al. 2012
Modification of Sanchez et al. IC in order to distinguish specificity of leaves according to their number of ancestors.
Mazandu et al. - 2012
Gaston K. Mazandu and Nicola J. Mulder, “A Topology-Based Metric for Measuring Term Similarity in the Gene Ontology,” Advances in Bioinformatics, vol. 2012, Article ID 975783, 17 pages, 2012. doi:10.1155/2012/975783 http://www.hindawi.com/journals/abi/2012/975783/
Extrinsic estimation
Resnik Information Content - 1995
Authors | Resnik |
Year | 1995 |
Type | Extrinsic |
Context | Semantic similarity on WordNet |
SML Flag | |
Dependencies | Requires an annotation repository or number of occurences per concepts |
Summary | The information content is defined as the inverse logarithm of the probability to encounter an instance annotated by the concept. |
Reference | [1] |
Original definition of information content initially defined by Resnik as $-log(p(u))$ for a concept $u$, which can be reformulated by:
Pairwise Semantic measures
Rada et al. - 1989
Rada defined the similarity of two concepts as a function of the shortest path linking the two concepts. Note that Rada originally defined a semantic distance between two concepts.
With sp(u,v) the shortest path linking the two concepts. The similarity is therefore defined by:
The measure proposed by Rada et al. was defined in the context of trees, i.e. a data structures in which there is no multi inheritances, a concept only have a unique parent. A common adaptation of the measure to DAG is to force the shortest path to contain a common ancestor of the compared concept.
Reference: Rada R, Mili H, Bicknell E, Blettner M: Development and application of a metric on semantic nets. Ieee Transactions On Systems Man And Cybernetics 1989, 19:17–30.
Wu & Palmer - 1994
The measure proposed by Wu & Palmer was defined in the context of semantic trees. The similarity is defined as a function of the depth of the compared concepts and the depth of their most specific common ancestor (MSCA), their common ancestor which has the maximal depth.
Reference: Wu Z, Palmer M: Verb semantics and lexical selection. In 32nd. Annual Meeting of the Association for Computational Linguistics. 1994:133–138.
Resnik - 1995
Authors | Resnik |
Year | 1995 |
Type | Pairwise |
Context | Semantic similarity on WordNet |
SML Flag | |
Dependencies | Requires an IC function to be defined |
Summary | The similarity is defined as the information content of the most informative common ancestors of the compared concepts |
Reference | [1] |
The similarity of two concepts $u$, $v$ is defined as the information content of their Most Informative Common Ancestor (MICA):
Note that the measure can be abstracted in order to tune it with other function estimating the specificity of concepts:
Resnik - 1995 (B)
Leacock and Chodorow - 1998
Adaptation of Rada et al. measure in order to take into account of the depth of the ontology.
Reference: Leacock C, Chodorow M: Combining Local Context and WordNet Similarity for Word Sense Identification. In WordNet: An electronic lexical database. edited by Fellbaum C MIT Press; 1998:265 – 283.
Lin - 1998
Adaptation of Lin measure in order to take into account of the specificity of compared concepts.
Stojanovic - 2001
Pekar & Staab - 2002
Li et al. - 2003
Slimani et al. - 2006
Al Mubaid - 2006
Ranwez et al. - 2006
Schlicker et al. - 2006
Wang et al. - 2007
G-Sesame - 2007
Jiang and Conrath - 2007
Li et al. - 2010
Effectively integrating information content and structural relationship to improve the GO-based similarity measure between proteins. Bo Li, James Z. Wang, F. Alex Feltus, Jizhong Zhou, Feng Luo
Kyogoku - 2011
Mazandu et al. - 2012
Groupwise Semantic measures
Existing measures
RDF Compatibility
The graph data model in use is not fully RDF-compatible in the sense that graph vertices can only represent URI (the SML takes advantage of the Sesame API [2] to process RDF data). Therefore, no literals and blank nodes can be loaded in the graph which are currently processed by the library. In addition, the graph representation is not triple-oriented and therefore does not fully embrace rigorous definition of RDF-graphs [3].
References
[1] | Using Information Content to Evaluate Semantic Similarity in a Taxonomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence IJCAI, 1:448-453, 1995. |
[2] | Sesame: A generic architecture for storing and querying rdf and rdf schema. In The Semantic Web — ISWC 2002, pages 54-68, Springer Berlin Heidelberg, 2002. |
[3] | Resource Description Framework (RDF): Concepts and Abstract Syntax. , 2004. |