SML Performance

SML compared to other tools

The aim of this section is to evaluate the performance of the SML considering specific usage contexts and other existing domain-specific solutions.

Important: This section is dedicated to the evaluation of the SML and contains information relative to other existing systems and tools.
All comparisons have been made rigorously considering detailed materials and methods. The source code on which are based the evaluations is open source code and freely available at: https://github.com/sharispe/sm-tools-evaluation
It can be used to reproduce the results presented in this section. Note that variations can be observed due to hardware changes but ordering must be the same when various systems are compared. The source code can also be updated to integrate the latest version of compared tools or to integrate new tools. Contributions are welcome!

SML and the Gene Ontology

Please refer to the complete evaluation for details. The benchmark used for the evaluation can be downloaded from /downloads/evaluations/sm-tools-evaluation/resources/data/go/benchmarks.

The detailed results for each test can be consulted at https://github.com/sharispe/sm-tools-evaluation. Below the digested results, 'X' means that the tests have not be performed (due to the performance of the tools). '!' means that the constraints have been reached and that the computation failed. SML Par(4) corresponds to the SML configured with 4 threads, i.e. to enable parallel computation on multi-core CPU (only adding -threads 4 to the classical SML command line).

Term to Term comparison

This test aims to compare the tools for the computation of semantic similarities between pair of terms defined in the Gene Ontology. Four tests have been designed. Each test is composed of a set of pair of terms for which we want the semantic similarity to be computed. Four tests of different sizes have been generated: 1k, 10k, 1M and 100M pairs of terms. (more ...)

Version 1k 10K 1M 100M
SML Par(4) 0.7 0m9.22 0m9.56 0m14.47 8m58.29
SML 0.7 0m9.23 0m9.76 0m19.55 16m30.24
FastSemSim 0.7.1 0m12.3 0m12.83 0m31.68 ! > 6Go memory
GOSim 1.2.7.7 0m49.46 3m21.5 X X
GOSemSim 1.18.0 1m34.69 16m21.34 X X
Comparison of Gene Products

This test aims to compare the tools to compute semantic similarities between pairs of gene products annotated by terms defined in the Gene Ontology. The protocol is similar to the one used for the comparison based on term similarity computation. Four tests have been designed. Each test is composed of a set of pairs of gene products for which we want the semantic similarity to be computed. Four sizes have been considered, 10k, 100k, 1M and 100M of pairs of gene products. (more ...)

Version 1k 10K 1M 100M
SML Par(4) 0.7 0m9.80 0m10.24 0m47.62 58m00.74
SML 0.7 0m10.01 0m11.18 1m38.87 133m27.44
FastSemSim 0.7.1 0m13.36 0m16.79 7m8.14 !
GOSim 1.2.7.7 ! ! ! !
GOSemSim 1.18.0 27m02.66 ! ! !
Versions
SML 0.7
FastSemSim 0.7.1 2012-11-28
GOSim 1.2.7.7 2012-10-09
GOSemSim 1.18.0 2013-04

SML vs DOSim package

date: 08/13

In this use case we reproduced the treatment performed in (Osborne et al., 2009) in which human genes were studied according to their annotations to the Disease Ontology (DO). The DO is an ontology structuring information related to human diseases e.g. associated vocabulary and phenotype characteristics. The study relies on the clustering of genes regarding their relationships to diseases.
The treatment can be split into two steps: first, the computation of the semantic similarity matrix containing the groupwise similarity (distances) of each pairs of genes; second, use of a clustering algorithm on the matrix. We only discuss the matrix computation which directly involves semantic measures computation.
The dataset used for the experiment was based on the DO and the human genes' annotations associated to the DOSim package (version 2.2, 09-12).
The dataset contained 4054 human genes for which DO annotations are provided. As a preprocessing, we cleaned the knowledge base removing redundancies i.e. 47% of the 26179 annotations available, which implies the modification of 42% of gene annotations. In order to cluster the 4054 genes, considering a symmetric semantic measure a total of about 8*10^6 evaluations were required.

The semantic similarity selected relied on an indirect groupwise measure based on the best match average strategy. The approached proposed by Schlicker et al. was used to compare the score of similarities between two GO terms (Schlicker et al., 2006).
The semantic similarity selected rests upon an indirect groupwise measures which is based on the best match average strategy. Schlicker et al’s measure was used as the pairwise measure. The matrix computation took 110 minutes on a dual core CPU with 8Go Ram using DOSim package, the dedicated library for the computation of semantic measures based on the DO. Thanks to both the multi-threading and the caching system supported by the SML, the same treatment was performed in under 2 minutes using the SML-Toolkit (version 0.6).
Showing highly competitive performance, the generic SML and associated toolkit have proved to be choice solutions for large scale computation of SMs in highly specific usage contexts.

Versions
version DOSim package 2.2 02/12 - official website

Limitations

Known Limitations

No persistent storage support

The SML version computes semantic measures based on an in-memory data model, i.e. the knowledge base (TBox and ABox of the ontology) is loaded in memory. This ensures rapid access to the information required to compute semantic measures and can be used to process graphs with millions of nodes/relationships depending on the amount of memory allocated to the SML. However this can be a problem to process very large graphs composed of hundreds of millions of relationships.

Partial RDF Compliance