Optimal Computation of Overabundant Words

Authors Yannis Almirantis, Panagiotis Charalampopoulos, Jia Gao, Costas S. Iliopoulos, Manal Mohamed, Solon P. Pissis, Dimitris Polychronopoulos

Thumbnail PDF


  • Filesize: 0.63 MB
  • 14 pages

Document Identifiers

Author Details

Yannis Almirantis
Panagiotis Charalampopoulos
Jia Gao
Costas S. Iliopoulos
Manal Mohamed
Solon P. Pissis
Dimitris Polychronopoulos

Cite AsGet BibTex

Yannis Almirantis, Panagiotis Charalampopoulos, Jia Gao, Costas S. Iliopoulos, Manal Mohamed, Solon P. Pissis, and Dimitris Polychronopoulos. Optimal Computation of Overabundant Words. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 4:1-4:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)


The observed frequency of the longest proper prefix, the longest proper suffix, and the longest infix of a word w in a given sequence x can be used for classifying w as avoided or overabundant. The definitions used for the expectation and deviation of w in this statistical model were described and biologically justified by Brendel et al. (J Biomol Struct Dyn 1986). We have very recently introduced a time-optimal algorithm for computing all avoided words of a given sequence over an integer alphabet (Algorithms Mol Biol 2017). In this article, we extend this study by presenting an O(n)-time and O(n)-space algorithm for computing all overabundant words in a sequence x of length n over an integer alphabet. Our main result is based on a new non-trivial combinatorial property of the suffix tree T of x: the number of distinct factors of x whose longest infix is the label of an explicit node of T is no more than 3n-4. We further show that the presented algorithm is time-optimal by proving that O(n) is a tight upper bound for the number of overabundant words. Finally, we present experimental results, using both synthetic and real data, which justify the effectiveness and efficiency of our approach in practical terms.
  • overabundant words
  • avoided words
  • suffix tree
  • DNA sequence analysis


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Yannis Almirantis, Panagiotis Charalampopoulos, Jia Gao, Costas S. Iliopoulos, Manal Mohamed, Solon P. Pissis, and Dimitris Polychronopoulos. On avoided words, absent words, and their application to biological sequence analysis. Algorithms for Molecular Biology, 12(1):5, 2017. Google Scholar
  2. Alberto Apostolico, Mary Ellen Bock, and Stefano Lonardi. Monotony of surprise and large-scale quest for unusual words. Journal of Computational Biology, 10(3-4):283-311, 2003. Google Scholar
  3. Alberto Apostolico, Mary Ellen Bock, Stefano Lonardi, and Xuyan Xu. Efficient detection of unusual words. Journal of Computational Biology, 7(1-2):71-94, 2000. Google Scholar
  4. Alberto Apostolico, Fang-Cheng Gong, and Stefano Lonardi. Verbumculus and the discovery of unusual words. Journal of Computer Science and Technology, 19(1):22-41, 2004. Google Scholar
  5. Djamal Belazzougui and Fabio Cunial. Space-efficient detection of unusual words. In SPIRE, volume 9309 of LNCS, pages 222-233. Springer, 2015. Google Scholar
  6. Volker Brendel, Jacques S Beckmann, and Edward N Trifonov. Linguistics of nucleotide sequences: morphology and comparison of vocabularies. Journal of Biomolecular Structure and Dynamics, 4(1):11-21, 1986. Google Scholar
  7. Chris Burge, Allan M. Campbello, and Samuel Karlin. Over- and under-representation of short oligonucleotides in DNA sequences. Proc Natl Acad Sci USA, 89(4):1358-1362, 1992. Google Scholar
  8. Maxime Crochemore, Christophe Hancart, and Thierry Lecroq. Algorithms on strings. 2007. Google Scholar
  9. Alain Denise, Mireille Régnier, and Mathias Vandenbogaert. Assessing the statistical significance of overrepresented oligonucleotides. In WABI, volume 2149 of LNCS, pages 85-97. Springer Berlin Heidelberg, 2001. Google Scholar
  10. Martin Farach. Optimal suffix tree construction with large alphabets. In FOCS, pages 137-143. IEEE, 1997. Google Scholar
  11. Mikhail S. Gelfand and Eugene V. Koonin. Avoidance of palindromic words in bacterial and archaeal genomes: a close connection with restriction enzymes. Nucleic Acids Research, 25(12):2430-2439, 1997. Google Scholar
  12. Simon Gog, Timo Beller, Alistair Moffat, and Matthias Petri. From theory to practice: Plug and play with succinct data structures. In SEA, volume 8504 of LNCS, pages 326-337. Springer, 2014. Google Scholar
  13. Nathan Harmston, Anja Barešić, and Boris Lenhard. The mystery of extreme non-coding conservation. Phil. Trans. R. Soc. B, 368(1632):20130021, 2013. Google Scholar
  14. Suzanne E. Hile and Kristin A. Eckert. Positive correlation between DNA polymerase α-primase pausing and mutagenesis within polypyrimidine/polypurine microsatellite sequences. Journal of Molecular Biology, 335(3):745-759, 2004. Google Scholar
  15. G. Levinson and G. A. Gutman. Slipped-strand mispairing: a major mechanism for DNA sequence evolution. Molecular Biology and Evolution, 4(3):203-221, 1987. Google Scholar
  16. Dimitris Polychronopoulos, Diamantis Sellis, and Yannis Almirantis. Conserved noncoding elements follow power-law-like distributions in several genomes as a result of genome dynamics. PloS One, 9(5):e95437, 2014. Google Scholar
  17. Dimitris Polychronopoulos, Emanuel Weitschek, Slavica Dimitrieva, Philipp Bucher, Giovanni Felici, and Yannis Almirantis. Classification of selectively constrained DNA elements using feature vectors and rule-based classifiers. Genomics, 104(2):79-86, 2014. Google Scholar
  18. Ivan Rusinov, Anna Ershova, Anna Karyagina, Sergey Spirin, and Andrei Alexeevski. Lifespan of restriction-modification systems critically affects avoidance of their recognition sites in host genomes. BMC Genomics, 16(1):1, 2015. Google Scholar
Questions / Remarks / Feedback

Feedback for Dagstuhl Publishing

Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail