Optimal Computation of Overabundant Words

Almirantis, Yannis; Charalampopoulos, Panagiotis; Gao, Jia; Iliopoulos, Costas S.; Mohamed, Manal; Pissis, Solon P.; Polychronopoulos, Dimitris

doi:10.4230/LIPIcs.WABI.2017.4

Abstract

The observed frequency of the longest proper prefix, the longest proper suffix, and the longest infix of a word w in a given sequence x can be used for classifying w as avoided or overabundant. The definitions used for the expectation and deviation of w in this statistical model were described and biologically justified by Brendel et al. (J Biomol Struct Dyn 1986). We have very recently introduced a time-optimal algorithm for computing all avoided words of a given sequence over an integer alphabet (Algorithms Mol Biol 2017). In this article, we extend this study by presenting an O(n)-time and O(n)-space algorithm for computing all overabundant words in a sequence x of length n over an integer alphabet. Our main result is based on a new non-trivial combinatorial property of the suffix tree T of x: the number of distinct factors of x whose longest infix is the label of an explicit node of T is no more than 3n-4. We further show that the presented algorithm is time-optimal by proving that O(n) is a tight upper bound for the number of overabundant words. Finally, we present experimental results, using both synthetic and real data, which justify the effectiveness and efficiency of our approach in practical terms.

Yannis Almirantis, Panagiotis Charalampopoulos, Jia Gao, Costas S. Iliopoulos, Manal Mohamed, Solon P. Pissis, and Dimitris Polychronopoulos. On avoided words, absent words, and their application to biological sequence analysis. Algorithms for Molecular Biology, 12(1):5, 2017.
Alberto Apostolico, Mary Ellen Bock, and Stefano Lonardi. Monotony of surprise and large-scale quest for unusual words. Journal of Computational Biology, 10(3-4):283-311, 2003.
Alberto Apostolico, Mary Ellen Bock, Stefano Lonardi, and Xuyan Xu. Efficient detection of unusual words. Journal of Computational Biology, 7(1-2):71-94, 2000.
Alberto Apostolico, Fang-Cheng Gong, and Stefano Lonardi. Verbumculus and the discovery of unusual words. Journal of Computer Science and Technology, 19(1):22-41, 2004.
Djamal Belazzougui and Fabio Cunial. Space-efficient detection of unusual words. In SPIRE, volume 9309 of LNCS, pages 222-233. Springer, 2015.
Volker Brendel, Jacques S Beckmann, and Edward N Trifonov. Linguistics of nucleotide sequences: morphology and comparison of vocabularies. Journal of Biomolecular Structure and Dynamics, 4(1):11-21, 1986.
Chris Burge, Allan M. Campbello, and Samuel Karlin. Over- and under-representation of short oligonucleotides in DNA sequences. Proc Natl Acad Sci USA, 89(4):1358-1362, 1992.
Maxime Crochemore, Christophe Hancart, and Thierry Lecroq. Algorithms on strings. 2007.
Alain Denise, Mireille Régnier, and Mathias Vandenbogaert. Assessing the statistical significance of overrepresented oligonucleotides. In WABI, volume 2149 of LNCS, pages 85-97. Springer Berlin Heidelberg, 2001.
Martin Farach. Optimal suffix tree construction with large alphabets. In FOCS, pages 137-143. IEEE, 1997.
Mikhail S. Gelfand and Eugene V. Koonin. Avoidance of palindromic words in bacterial and archaeal genomes: a close connection with restriction enzymes. Nucleic Acids Research, 25(12):2430-2439, 1997.
Simon Gog, Timo Beller, Alistair Moffat, and Matthias Petri. From theory to practice: Plug and play with succinct data structures. In SEA, volume 8504 of LNCS, pages 326-337. Springer, 2014.
Nathan Harmston, Anja Barešić, and Boris Lenhard. The mystery of extreme non-coding conservation. Phil. Trans. R. Soc. B, 368(1632):20130021, 2013.
Suzanne E. Hile and Kristin A. Eckert. Positive correlation between DNA polymerase α-primase pausing and mutagenesis within polypyrimidine/polypurine microsatellite sequences. Journal of Molecular Biology, 335(3):745-759, 2004.
G. Levinson and G. A. Gutman. Slipped-strand mispairing: a major mechanism for DNA sequence evolution. Molecular Biology and Evolution, 4(3):203-221, 1987.
Dimitris Polychronopoulos, Diamantis Sellis, and Yannis Almirantis. Conserved noncoding elements follow power-law-like distributions in several genomes as a result of genome dynamics. PloS One, 9(5):e95437, 2014.
Dimitris Polychronopoulos, Emanuel Weitschek, Slavica Dimitrieva, Philipp Bucher, Giovanni Felici, and Yannis Almirantis. Classification of selectively constrained DNA elements using feature vectors and rule-based classifiers. Genomics, 104(2):79-86, 2014.
Ivan Rusinov, Anna Ershova, Anna Karyagina, Sergey Spirin, and Andrei Alexeevski. Lifespan of restriction-modification systems critically affects avoidance of their recognition sites in host genomes. BMC Genomics, 16(1):1, 2015.

Optimal Computation of Overabundant Words

Authors Yannis Almirantis, Panagiotis Charalampopoulos, Jia Gao, Costas S. Iliopoulos, Manal Mohamed, Solon P. Pissis, Dimitris Polychronopoulos

File

Document Identifiers

Author Details

Cite As Get BibTex

Abstract

Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message