Medical Subject Headings (MeSH)

Medical Subject Headings (MeSH) is a comprehensive controlled vocabulary for the purpose of indexing journal articles and books in the life sciences; it can also serve as a thesaurus that facilitates searching. Created and updated by the United States National Library of Medicine (NLM), it is used by the MEDLINE/PubMed article database and by NLM's catalog of book holdings (source wikipedia).
The last version of the MeSH can be downloaded at dedicated web site.

This web page is dedicated to the processing of MeSH using the Semantic Measures Library (SML).
We discuss:

  • MeSH Graph generation
  • MeSH loading in the SML
  • Computation of semantic similarities over the MesH

MeSH Graph generation

MeSH is a thesaurus, i.e. a controlled vocabulary. We distinguished (i) MeSH descriptors representing kind of concepts and (ii) the entry terms corresponding to the many synonyms, near-synonyms, and closely related concepts which can be associated to MeSH descriptors.

MeSH descriptors are structured in various MeSH Trees structures corresponding to the multiple categories used to organize the descriptors (e.g Diseases, Chemical and Drugs). Those trees can be found here. MesH descriptors can be used in multiple trees through various Tree nodes, which are identified by Tree numbers, e.g. C16.131.42). Below an example of descriptors hierarchy (source) which can be found in the tree related to Disease (Tree C).

Congenital Abnormalities C16.131
    Abnormalities, Drug Induced C16.131.42
    Abnormalities, Multiple C16.131.77
        Alagille Syndrome C16.131.77.65
        Alstrom Syndrome C16.131.77.80
        Angelman Syndrome C16.131.77.95

The figure presented below shows a graphical representation of two trees. Notice that the roots of the trees (e.g. [C] for tree C) are not explicitly defined in the MeSH, i.e. no descriptors are associated to them. It is however important to include them in order to stress that the upper concepts of a tree (e.g. C16, ...) are part of the same hierarchy (i.e. we define a common ancestors, this is required to use most of semantic similarity measures).

As we said, to a MESH descriptor (concept) is associated multiple tree nodes. In other words, a concept can be used in multiple trees. Thus, the next step is to merge the trees according to the usage of the concepts in the trees. The figure below presents the two steps of the merging process. First the 'mappings' between the trees nodes are identified (based on the MeSH specification) [A]. Second, the trees are merged according to the identified mappings [B]. Finally, the trees are rooted by a fictive root in order to obtain a global hierarchy.

It's important to understand that MesH provides hierarchical organization of concepts in a specific tree. However, to compute semantic similarity measures based on the hierarchical ordering of concepts, trees must be merged in order to obtain a rooted Directed Acyclic Graph (rDAG, figure above [B]). To obtain this rDAG, the graph we obtain merging the various trees must not contains cycles. However, due to some choices in the definition of each trees, cycles are introduced during the merging phase. Indeed, MeSH is a thesaurus and concepts (descriptors) can be organized in hierarchies using broader-narrower relationships, e.g. using SKOS. Therefore, MeSH are not transitive hierarchies, i.e. ''this means that their semantics do not support inferences of the type: if "animals" is broader than "mammals" and "mammals" is broader than "cats", then "animals" is broader than "cats"'' (source). However, most semantic similarity measures are based on the hierarchical ordering of the concepts. Thus, in order to be used to compare MeSH concepts, the MeSH graph obtained merging the trees must be modified to ensure the graph is cycle free.

Notice that original specification must, as much as possible, not be modified. However, since most semantic similarity measures used rDAG (and numerous scientific contribution are based on them), we detail how to obtain a MeSH Graph which is a rDAG.

In the 2013 MeSH version the cycles are due to conflicts between:

  • Ethics / Morals concepts, i.e. D004989 / D009014
  • Hydroxybutyrates / 3-Hydroxybutyric Acid concepts, i.e. D006885 / D020155

Ethics and Morals concept definitions introduce a cycle. Indeed, according to the XML format of the MeSH:

Ethics: D004989
<DescriptorRecord DescriptorClass = "1">
  <DescriptorUI>D004989</DescriptorUI>
  <DescriptorName>
   <String>Ethics</String>
  </DescriptorName>
  ...
  <TreeNumberList>
   <TreeNumber>F01.829.500.519</TreeNumber>
   <TreeNumber>K01.316</TreeNumber>
   <TreeNumber>K01.752.256</TreeNumber>
   <TreeNumber>N05.350</TreeNumber>
  </TreeNumberList>
  ...
Morals: D009014
  <DescriptorRecord DescriptorClass = "1">
  <DescriptorUI>D009014</DescriptorUI>
  <DescriptorName>
   <String>Morals</String>
  </DescriptorName>
  ...
  <TreeNumberList>
   <TreeNumber>F01.829.500</TreeNumber>
   <TreeNumber>K01.316.630</TreeNumber>
   <TreeNumber>K01.752.256.547</TreeNumber>
  </TreeNumberList>
  ...

To Ethic is associated the tree nodes: F01.829.500.519 and K01.316. To Morals is associated: F01.829.500 and K01.316.630.
According to F01.829.500.519 (Ethic) and F01.829.500 (Morals) the Mesh Tree F defines Ethic as subsumed (narrower) by Morals. However, according to K01.316 (Ethic) and K01.316.630 (Morals) the Mesh Tree F defines the opposite, i.e. Morals is subsumed (narrower) by Ethic.

Hydroxybutyrates: D006885
  <DescriptorRecord DescriptorClass = "1">
  <DescriptorUI>D006885</DescriptorUI>
  <DescriptorName>
   <String>Hydroxybutyrates</String>
  </DescriptorName>
  ...
  <TreeNumberList>
   <TreeNumber>D02.241.081.114.937</TreeNumber>
   <TreeNumber>D02.241.511.400.410</TreeNumber>
   <TreeNumber>D10.251.400.143.781</TreeNumber>
  </TreeNumberList>
  ...
3-Hydroxybutyric Acid concepts: D020155
<DescriptorRecord DescriptorClass = "1">
  <DescriptorUI>D020155</DescriptorUI>
  <DescriptorName>
   <String>3-Hydroxybutyric Acid</String>
  </DescriptorName>
  ...
  <TreeNumberList>
   <TreeNumber>D02.241.081.114.937.349</TreeNumber>
   <TreeNumber>D02.241.511.400</TreeNumber>
   <TreeNumber>D02.522.585.087</TreeNumber>
   <TreeNumber>D10.251.400.143.781.500</TreeNumber>
  </TreeNumberList>
  ...

To Hydroxybutyrates is associated the tree nodes: D02.241.081.114.937 and D02.241.511.400.410. To 3-Hydroxybutyric Acid is associated: D02.241.081.114.937.349 and D02.241.511.400.
According to D02.241.081.114.937 (Hydroxybutyrates) and D02.241.081.114.937.349 (3-Hydroxybutyric Acid) the Mesh Tree D defines 3-Hydroxybutyric Acid as subsumed (narrower) by Hydroxybutyrates. However, the pair of tree nodes to D02.241.511.400.410 (Hydroxybutyrates) and D02.241.511.400 (3-Hydroxybutyric Acid) expresses the opposite.

In order to remove those cycles we delete some relationships to consider that Morals subsumes Ethics and Hydroxybutyrates subsumes 3-Hydroxybutyric Acid. This process is done programmatically. Examples can be found in the MeSH section of the SML examples.

MeSH Loading using the SML

XML Loader

The SML provides a loader to build the MeSH graph from the XML version of the MeSH. The loader was developed for the 2013 version, compatibility with prior version is not currently supported.
For each tree a vertex is created as a tree root. Moreover during the tree merging process, the virtual root of the graph (i.e. the vertex subsuming each tree) is defined as http://www.w3.org/2002/07/owl#Thing. Finally the MeSH graph corresponds to the graph presented in the figure above ([B]). Only the concept hierarchy is loaded. The concepts are linked by rdfs:subClassOf relationships. All the vertices are associated to a type CLASS in order to be processed by the semantic measures engine. The URI of a concept is build from the URI associated to the graph, in the form graphURI + DescriptorUI.
Source code example.

SKOS Loader

The MeSH is not officially distributed in SKOS. However, some tools have been developed to generate a SKOS format of the MeSH (MeshToSKOS, another tool).
This section describes how to use those SKOS MeSH versions in the SML.

The SML contains various classes and source code to process Classes structured taxonomically. In SKOS, concepts are not classes but instances of type skos:Concept. Moreover, the concepts are not interlinked by rdfs:subClassOf relationships but skos:broader and skos:narrower are used to define concept hierarchy. Notice that skos:broader/narrower are not transitive relationships. However, if the graph induced by these relationships is a Directed Acyclic Graph, we can tune the library in order to use the Semantic Measures Engine on SKOS Hierarchies. This approach is adopted to use semantic similarity measures based on the MeSH concepts hierarchy.

Notice that the SKOS format of the MeSH is loaded using a RDF loader (e.g. RDF XML loader). Concepts are instances of the class skos:concepts. They will be typed as CLASS in order to use the Semantic Measures Engine (originally defined to compare classes). Notice also that the MeSH graph will not contain the tree roots (e.g. concept Disease [C] in the Tree C). Indeed those concepts are not specified in the MeSH (no descriptors associated). To create those tree roots, we must know the tree numbers associated to a concepts, which is not the case in the MeSH expressed in SKOS. The virtual root must also be created programmatically.