Toolkit tutorial: Gene Ontology

Please refer to the general SML-Toolkit documentation before starting the tutorial.

This tutorial presents how to take advantage of the SML-ToolKit to compute the semantic similarity of human enes according to their annotations to the Gene Ontology. Only Biological Process (BP) annotations will be considered herein i.e. semantic similarity will be evaluated only evaluating the biological process the genes are involved in.

Prerequesites

We first discuss some prerequisites in order to both use the SML-ToolKit and prepare the workspace and data we will consider in this tutorial.

Java

The SML-ToolKit requires Java to be installed. Java is commonly used and is therefore most of the time already installed on most computers. However, be sure Java version 1.7 or later is installed. You can check your java installation (availability and version) on the Java website.

SML-ToolKit

Download the latest version of the Semantic Measure Library here.

Data

Download the data we will use in this tutorial:

You can also reproduce the following tutorial using an up-to-date version of the data:

Workspace preparation

Unzip the tutorial archive (see above) in a directory named 'workspace'. We will call this directory 'workspace' in the following documentation.

We will also consider that command lines are executed from the workspace.
Move the SML-ToolKit in the workspace directory.
The workspace directory must respect the structure:

your_workspace/	
	sml-toolkit-<version>.jar 
	go_tutorial/
			data/
				gene_ontology_ext.obo
				gene_association.goa_human
				queries.csv
				results/
			conf/

Command line execution

The SML-ToolKit is used through a command-line interface.
Launch a terminal/console and move into the workspace directory.
Note that you only need to know how to move in your directory structure (Windows dir, Linux cd).
Documentation; Linux, Windows, Mac .

Execute the tool used to compute Semantic Measures

In the workspace i.e. where the toolkit sml-toolkit-<version>.jar is located, execute the command-line specified below . Replace <version> by the version of the toolkit e.g. 0.0.3

	java -jar sml-toolkit-<version>.jar -t sm 

An error asks you to provide an XML configuration file.
As we have seen in the general documentation, the SML-ToolKit must be configured through an XML file. We will create such a configuration file in the next section. Note that, despite the error specified, this command execution provides you some information such as the version of the tool or the developer you must contact if you encounter a problem using the tool.

XML configuration

Note that path expression depends on the operating system you use (e.g. Linux, Windows, Mac). This tutorial was made using a Linux based distribution and paths will therefore be characterized by '/' separator. Replace the path separator to the one used by your operating system, e.g. windows paths use '\' as separator and "sm/go_tutorial/conf/" must therefore be replaced by "sm\go_tutorial\conf\".

The configuration file will be created in the directory workspace/go_tutorial/conf/. We name it sm_conf_human_bp.xml.
We first define a variable corresponding to the workspace directory, replace path_to_your_working_directory by the complete path of the directory you use as workspace.

We also create some variables to easily locate annotation and Gene Ontology files.
Other variables are created to facilitate the definition of the configuration file.

We also add a tag defining the global option of the SML-ToolKit. The number of thread is set to its default value, i.e. 1.



We then define information related to the Gene Ontology.

Since the Gene Ontology (GO) is expressed in OBO format, we used the OBO parser.

Moreover, because namespaces are not always specified in OBO specification, we manually define the namespace associated to the GO prefix.
This prefix is encountred in the GO specification as GO terms are specified using unique id respecting the pattern GO:number.
The parser will convert all ids of GO terms considering the specified namespace i.e. GO:0008150 will be loaded as {URI_GO}0008150 wich, considering the specified variable {URI_GO}, will create an URI of the form http://biograph/go/0008150.

We also specify:

  • an URI associated to the graph, here set as {URI_GRAPH},
  • the file corresponding to the GO
  • the file corresponding to the UniprotKB proteins and GO annotations. Proteins we be loaded with the URI prefix http://biograph/uniprotkb/
  • the root considered (reduction performed in order to only considerate biological process GO subontology)


The current annotation loading to not apply restriction on the evidence code, this can be tunned using filters.
The knowldge base contains redundancies as no transitive reduction is performed (see actions section in the general documentation).
We can check the validity of the configuration file only adding the empty sml tag.

Execute the command line:

	java -jar sml-toolkit-.jar -t sm -xmlconf go_tutorial/conf/sm_conf_human_bp.xml

Computing gene similarities for one measure

We define the sm tag in order to compute similarity of entities based on an indirect groupwise measure based on Resnik's pairwise measure scores aggregated using a Max strategy.
We first define a way to compute the information content (IC) of terms.
Here we use a corpus based information content i.e. normalized Resnik definition of the information content. The id of the IC will be used to refer to the IC measure.
We define the Resnik pairwise measure linking it to the specified IC we want to use i.e. icCorpus.
Finally we define an indirect groupwise measure defining a way to aggregate the pairwise scores. We select the MAX approach.

We have configured the data defining the knowledge to take into account and the measure we want to use.
We also expressed our queries i.e. the pair of genes we want to compare using the queries tag.
The file /go_tutorial/data/queries/queries.csv contains pair of entities in a tabular format.

The queries are defined in the file /data/queries/queries.csv (see file example below).
For each row a particular query will be performed.

P16591	Q00839
Q00839	E2QRD5
E2QRD5	Q9H9B1
Q9H9B1	Q9H9B4
Q9H9B4	B0QZK4
B0QZK4	E2QRC0
E2QRC0	Q9H9A7
Q9H9A7	Q9H9A5
Q9H9A5	E2QRB9
E2QRB9	A8MUM7
A8MUM7	E2QRB3
E2QRB3	Q5BJH2
...

The final configuration file is (replace you workspace i.e. {HOME}):



Finally, execution can be launched:
	java -jar sml-toolkit-.jar -t sm -xmlconf go_tutorial/conf/sm_conf_human_bp.xml
As specified in the configuration, results are generated in {HOME}/go_tutorial/data/results/queries_results.csv

Compute gene similarities for numerous measures

To compute mulitple similarity scores in the same run, you can add measures to the configuration file, the tool will automagically managed them (replace you workspace i.e. {HOME}).