Using GO terms to evaluate protein clustering

return to the website
by Hugo Bastos, Daniel Faria, Catia Pesquita, André O. Falcão
Abstract:
Motivation: Protein sequence data is growing at an exponential rate. However a considerable portion of this data is redundant, with many new sequences being very similar to others in the databases. While clustering has been used to reduce this redundancy, the influence of sequence similarity in the functional quality of the clusters is still unclear. Results: In this work, we introduce a greedy graph-based clustering algorithm, which is tested using the Swiss-Prot database. We study the topology of the protein space as function of the threshold BLAST e-values, and the functional characterization of the clusters using the Gene Ontology. Initial results show that seemingly the cluster centers alone can capture a large portion of the information content of the database, therefore largely reducing its redundancy. Also it was found an expected increase of cluster functional coherence and characterization with the stringency of the threshold, as well as the amount of information captured by the cluster centers. 1
Reference:
Using GO terms to evaluate protein clustering (Hugo Bastos, Daniel Faria, Catia Pesquita, André O. Falcão), 2007.
Bibtex Entry:
@misc{Bastos,
abstract = {Motivation: Protein sequence data is growing at an exponential rate. However a considerable portion of this data is redundant, with many new sequences being very similar to others in the databases. While clustering has been used to reduce this redundancy, the influence of sequence similarity in the functional quality of the clusters is still unclear. Results: In this work, we introduce a greedy graph-based clustering algorithm, which is tested using the Swiss-Prot database. We study the topology of the protein space as function of the threshold BLAST e-values, and the functional characterization of the clusters using the Gene Ontology. Initial results show that seemingly the cluster centers alone can capture a large portion of the information content of the database, therefore largely reducing its redundancy. Also it was found an expected increase of cluster functional coherence and characterization with the stringency of the threshold, as well as the amount of information captured by the cluster centers. 1},
author = {Bastos, Hugo and Faria, Daniel and Pesquita, Catia and Falc\~{a}o, Andr\'{e} O.},
keywords = {SML-LIB-BIBLIO,lang:ENG},
mendeley-tags = {SML-LIB-BIBLIO,lang:ENG},
title = {{Using GO terms to evaluate protein clustering}},
url = {http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.103.5259},
year = {2007}
}
Powered by bibtexbrowser