Improving Semantic Similarity for Proteins based on the Gene Ontology

Pesquita, Catia

by Catia Pesquita

Abstract:

One of the current challenges in the Life Sciences is to extract the knowledge contained in the vast amount of data that the genomic and post-genomic techniques are producing. One of the major efforts in this area was the development of the Gene Ontology (GO), a BioOntology that contains terms that describe gene products, organized in a graph structure. Gene products annotated with ontology terms can be compared according to them. This process is called semantic similarity and it is based on the structure of the BioOntology and the relations between its terms, focusing either on a structural comparison or more frequently on the semantic similarity between the terms themselves. In this work, I developed two novel hybrid measures of semantic similarity for proteins based on the Gene Ontology: simGIC (Graph-Information Content similarity) and simGED (Graph-Edit-Distance similarity). These measures were designed to take into account both graph attributes and the terms' information content, thus capturing more information than the previously existing measures which focused mostly on a single aspect (graph structure or term similarity). These two novel measures were evaluated against several previously proposed measures, using two strategies: relationship with sequence similarity and correlation with family similarity. The evaluation metric in the sequence similarity studies was the resolution of the measures, i.e. the range of semantic similarity values they cover, since most measures showed the same behaviour and similar correlation values. Overall simGIC was shown to be the best performer, with both the highest resolutions in the sequence similarity evaluation and highest correlation to family similarity, while simGED obtained above average results. The in uence of electronic annotations was also investigated but I found no conclusive evidence to support the general view that these are unreliable to use in semantic similarity studies. Keywords: Semantic Similarity, BioOntologies, Gene Ontology, Genome Annotation.Abstract One of the current challenges in the Life Sciences is to extract the knowledge contained in the vast amount of data that the genomic and post-genomic techniques are producing. One of the major efforts in this area was the development of the Gene Ontology (GO), a BioOntology that contains terms that describe gene products, organized in a graph structure. Gene products annotated with ontology terms can be compared according to them. This process is called semantic similarity and it is based on the structure of the BioOntology and the relations between its terms, focusing either on a structural comparison or more frequently on the semantic similarity between the terms themselves. In this work, I developed two novel hybrid measures of semantic similarity for proteins based on the Gene Ontology: simGIC (Graph-Information Content similarity) and simGED (Graph-Edit-Distance similarity). These measures were designed to take into account both graph attributes and the terms' information content, thus capturing more information than the previously existing measures which focused mostly on a single aspect (graph structure or term similarity). These two novel measures were evaluated against several previously proposed measures, using two strategies: relationship with sequence similarity and correlation with family similarity. The evaluation metric in the sequence similarity studies was the resolution of the measures, i.e. the range of semantic similarity values they cover, since most measures showed the same behaviour and similar correlation values. Overall simGIC was shown to be the best performer, with both the highest resolutions in the sequence similarity evaluation and highest correlation to family similarity, while simGED obtained above average results. The influence of electronic annotations was also investigated but I found no conclusive evidence to support the general view that these are unreliable to use in semantic similarity studies.

Reference:

Improving Semantic Similarity for Proteins based on the Gene Ontology (Catia Pesquita), PhD thesis, , 2007.

Bibtex Entry:

@phdthesis{Pesquita2008a,
abstract = {One of the current challenges in the Life Sciences is to extract the knowledge contained in the vast amount of data that the genomic and post-genomic techniques are producing. One of the major efforts in this area was the development of the Gene Ontology (GO), a BioOntology that contains terms that describe gene products, organized in a graph structure. Gene products annotated with ontology terms can be compared according to them. This process is called semantic similarity and it is based on the structure of the BioOntology and the relations between its terms, focusing either on a structural comparison or more frequently on the semantic similarity between the terms themselves. In this work, I developed two novel hybrid measures of semantic similarity for proteins based on the Gene Ontology: simGIC (Graph-Information Content similarity) and simGED (Graph-Edit-Distance similarity). These measures were designed to take into account both graph attributes and the terms' information content, thus capturing more information than the previously existing measures which focused mostly on a single aspect (graph structure or term similarity). These two novel measures were evaluated against several previously proposed measures, using two strategies: relationship with sequence similarity and correlation with family similarity. The evaluation metric in the sequence similarity studies was the resolution of the measures, i.e. the range of semantic similarity values they cover, since most measures showed the same behaviour and similar correlation values. Overall simGIC was shown to be the best performer, with both the highest resolutions in the sequence similarity evaluation and highest correlation to family similarity, while simGED obtained above average results. The in uence of electronic annotations was also investigated but I found no conclusive evidence to support the general view that these are unreliable to use in semantic similarity studies. Keywords: Semantic Similarity, BioOntologies, Gene Ontology, Genome Annotation.Abstract One of the current challenges in the Life Sciences is to extract the knowledge contained in the vast amount of data that the genomic and post-genomic techniques are producing. One of the major efforts in this area was the development of the Gene Ontology (GO), a BioOntology that contains terms that describe gene products, organized in a graph structure. Gene products annotated with ontology terms can be compared according to them. This process is called semantic similarity and it is based on the structure of the BioOntology and the relations between its terms, focusing either on a structural comparison or more frequently on the semantic similarity between the terms themselves. In this work, I developed two novel hybrid measures of semantic similarity for proteins based on the Gene Ontology: simGIC (Graph-Information Content similarity) and simGED (Graph-Edit-Distance similarity). These measures were designed to take into account both graph attributes and the terms' information content, thus capturing more information than the previously existing measures which focused mostly on a single aspect (graph structure or term similarity). These two novel measures were evaluated against several previously proposed measures, using two strategies: relationship with sequence similarity and correlation with family similarity. The evaluation metric in the sequence similarity studies was the resolution of the measures, i.e. the range of semantic similarity values they cover, since most measures showed the same behaviour and similar correlation values. Overall simGIC was shown to be the best performer, with both the highest resolutions in the sequence similarity evaluation and highest correlation to family similarity, while simGED obtained above average results. The influence of electronic annotations was also investigated but I found no conclusive evidence to support the general view that these are unreliable to use in semantic similarity studies.},
author = {Pesquita, Catia},
booktitle = {Genome},
keywords = {SML-LIB-BIBLIO,lang:ENG},
mendeley-tags = {SML-LIB-BIBLIO,lang:ENG},
title = {{Improving Semantic Similarity for Proteins based on the Gene Ontology}},
year = {2007}
}