Modeling the percolation of annotation errors in a database of protein sequences

Walter R. Gilks*, Benjamin Audit, Daniela De Angelis, Sophia Tsoka, Christos A. Ouzounis

*Corresponding author for this work

    Research output: Contribution to journalArticlepeer-review

    132 Citations (Scopus)


    Public sequence databases contain information on the sequence, structure and function of proteins. Genome sequencing projects have led to a rapid increase in protein sequence information, but reliable, experimentally verified, information on protein function lags a long way behind. To address this deficit, functional annotation in protein databases is often inferred by sequence similarity to homologous, annotated proteins, with the attendant possibility of error. Now, the functional annotation in these homologous proteins may itself have been acquired through sequence similarity to yet other proteins, and it is generally not possible to determine how the functional annotation of any given protein has been acquired. Thus the possibility of chains of misannotation arises, a process we term 'error percolation'. With some simple assumptions, we develop a dynamical probabilistic model for these misannotation chains. By exploring the consequences of the model for annotation quality it is evident that this iterative approach leads to a systematic deterioration of database quality.

    Original languageEnglish
    Pages (from-to)1641-1649
    Number of pages9
    Issue number12
    Publication statusPublished - 1 Dec 2002

    Bibliographical note

    Funding Information:
    We thank Peter Karp (SRI International) for discussions, Anton Enright for the Tribes database and members of the Computational Genomics Group for comments. BA is currently supported by a Marie Curie Fellowship of the European Community programme ‘Improving Human Research Potential and the Socio-economics Knowledge Base’ under contract number HPMF-CT-2001-01321 and ST by the Medical Research Council (UK). Additional support was provided by the European Molecular Biology Laboratory.


    Dive into the research topics of 'Modeling the percolation of annotation errors in a database of protein sequences'. Together they form a unique fingerprint.

    Cite this