Percolation of annotation errors through hierarchically structured protein sequence databases

Walter R. Gilks*, Benjamin Audit, Daniela De Angelis, Sophia Tsoka, Christos A. Ouzounis

*Corresponding author for this work

    Research output: Contribution to journalArticlepeer-review

    54 Citations (Scopus)


    Databases of protein sequences have grown rapidly in recent years as a result of genome sequencing projects. Annotating protein sequences with descriptions of their biological function ideally requires careful experimentation, but this work lags far behind. Instead, biological function is often imputed by copying annotations from similar protein sequences. This gives rise to annotation errors, and more seriously, to chains of misannotation. [Percolation of annotation errors in a database of protein sequences (2002)] developed a probabilistic framework for exploring the consequences of this percolation of errors through protein databases, and applied their theory to a simple database model. Here we apply the theory to hierarchically structured protein sequence databases, and draw conclusions about database quality at different levels of the hierarchy.

    Original languageEnglish
    Pages (from-to)223-234
    Number of pages12
    JournalMathematical Biosciences
    Issue number2
    Publication statusPublished - Feb 2005


    • Annotation errors
    • Biological function
    • Database quality
    • Hierarchical classification
    • Homology
    • Probability model
    • Protein database
    • Protein sequence


    Dive into the research topics of 'Percolation of annotation errors through hierarchically structured protein sequence databases'. Together they form a unique fingerprint.

    Cite this