Degenerate String Comparison and Applications

Authors Mai Alzamel, Lorraine A. K. Ayad, Giulia Bernardini, Roberto Grossi, Costas S. Iliopoulos, Nadia Pisanti, Solon P. Pissis, Giovanna Rosone



PDF
Thumbnail PDF

File

LIPIcs.WABI.2018.21.pdf
  • Filesize: 0.53 MB
  • 14 pages

Document Identifiers

Author Details

Mai Alzamel
  • Department of Informatics, King’s College London, UK and Department of Computer Science, King Saud University, KSA
Lorraine A. K. Ayad
  • Department of Informatics, King’s College London, UK
Giulia Bernardini
  • Department of Informatics, Systems and Communication (DISCo), University of Milan-Bicocca, Italy
Roberto Grossi
  • Department of Computer Science, University of Pisa, Italy and ERABLE Team, INRIA, France
Costas S. Iliopoulos
  • Department of Informatics, King’s College London, UK
Nadia Pisanti
  • Department of Computer Science, University of Pisa, Italy and ERABLE Team, INRIA, France
Solon P. Pissis
  • Department of Informatics, King’s College London, UK
Giovanna Rosone
  • Department of Computer Science, University of Pisa, Italy

Cite AsGet BibTex

Mai Alzamel, Lorraine A. K. Ayad, Giulia Bernardini, Roberto Grossi, Costas S. Iliopoulos, Nadia Pisanti, Solon P. Pissis, and Giovanna Rosone. Degenerate String Comparison and Applications. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 21:1-21:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)
https://doi.org/10.4230/LIPIcs.WABI.2018.21

Abstract

A generalised degenerate string (GD string) S^ is a sequence of n sets of strings of total size N, where the ith set contains strings of the same length k_i but this length can vary between different sets. We denote the sum of these lengths k_0, k_1,...,k_{n-1} by W. This type of uncertain sequence can represent, for example, a gapless multiple sequence alignment of width W in a compact form. Our first result in this paper is an O(N+M)-time algorithm for deciding whether the intersection of two GD strings of total sizes N and M, respectively, over an integer alphabet, is non-empty. This result is based on a combinatorial result of independent interest: although the intersection of two GD strings can be exponential in the total size of the two strings, it can be represented in only linear space. A similar result can be obtained by employing an automata-based approach but its cost is alphabet-dependent. We then apply our string comparison algorithm to compute palindromes in GD strings. We present an O(min{W,n^2}N)-time algorithm for computing all palindromes in S^. Furthermore, we show a similar conditional lower bound for computing maximal palindromes in S^. Finally, proof-of-concept experimental results are presented using real protein datasets.

Subject Classification

ACM Subject Classification
  • Theory of computation → Pattern matching
Keywords
  • degenerate strings
  • generalised degenerate strings
  • elastic-degenerate strings
  • string comparison
  • palindromes

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Karl Abrahamson. Generalized string matching. SIAM J. Comput., 16(6):1039-1051, 1987. Google Scholar
  2. Michał Adamczyk, Mai Alzamel, Panagiotis Charalampopoulos, Costas S. Iliopoulos, and Jakub Radoszewski. Palindromic decompositions with gaps and errors. In CSR, volume 10304 of LNCS, pages 48-61. Springer International Publishing, 2017. Google Scholar
  3. Ali Alatabbi, Costas S. Iliopoulos, and M. Sohel Rahman. Maximal palindromic factorization. In PSC, pages 70-77, 2013. Google Scholar
  4. Yannis Almirantis, Panagiotis Charalampopoulos, Jia Gao, Costas S. Iliopoulos, Manal Mohamed, Solon P. Pissis, and Dimitris Polychronopoulos. On avoided words, absent words, and their application to biological sequence analysis. Algorithms for Molecular Biology, 12(1):5, 2017. Google Scholar
  5. Mai Alzamel, Jia Gao, Costas S. Iliopoulos, Chang Liu, and Solon P. Pissis. Efficient computation of palindromes in sequences with uncertainties. In EANN, volume 744 of CCIS, pages 620-629. Springer, 2017. Google Scholar
  6. Alberto Apostolico, Dany Breslauer, and Zvi Galil. Parallel detection of all palindromes in a string. Theoretical Computer Science, 141(1):163-173, 1995. Google Scholar
  7. Michael A. Bender and Martín Farach-Colton. The LCA problem revisited. In LATIN, volume 1776 of LNCS, pages 88-94. Springer, 2000. Google Scholar
  8. Giulia Bernardini, Nadia Pisanti, Solon P. Pissis, and Giovanna Rosone. Pattern matching on elastic-degenerate text with errors. In SPIRE, volume 10508 of LNCS, pages 74-90. Springer, 2017. Google Scholar
  9. Kirill Borozdin, Dmitry Kosolobov, Mikhail Rubinchik, and Arseny M. Shur. Palindromic Length in Linear Time. In CPM, volume 78 of LIPIcs, pages 23:1-23:12. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017. Google Scholar
  10. The Computational Pan-Genomics Consortium. Computational pan-genomics: status, promises and challenges. Briefings in Bioinformatics, pages 1-18, 2016. Google Scholar
  11. Maxime Crochemore, Costas S. Iliopoulos, Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, and Tomasz Walen. Covering problems for partial words and for indeterminate strings. Theoretical Computer Science, 698:25-39, 2017. Google Scholar
  12. Martin Farach. Optimal suffix tree construction with large alphabets. In FOCS, pages 137-143. IEEE, 1997. Google Scholar
  13. Gabriele Fici, Travis Gagie, Juha Kärkkäinen, and Dominik Kempa. A subquadratic algorithm for minimum palindromic factorization. Journal of Discrete Algorithms, 28:41-48, 2014. Google Scholar
  14. Martin C. Frith, Ulla Hansen, John L. Spouge, and Zhiping Weng. Finding functional sequence elements by multiple local alignment. Nucleic Acids Res., 32(1):189-200, 2004. Google Scholar
  15. J. A. Gally and G. M. Edelman. The genetic control of immunoglobulin synthesis. Annual Review of Genetics, 6(1):1-46, 1972. Google Scholar
  16. Jiawei Gao and Russell Impagliazzo. Orthogonal vectors is hard for first-order properties on sparse graphs. Electronic Colloquium on Computational Complexity (ECCC), 23:53, 2016. Google Scholar
  17. Roberto Grossi, Costas S. Iliopoulos, Chang Liu, Nadia Pisanti, Solon P. Pissis, Ahmad Retha, Giovanna Rosone, Fatima Vayani, and Luca Versari. On-line pattern matching on a set of similar texts. In CPM, LIPIcs. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017. Google Scholar
  18. Dan Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, New York, NY, USA, 1997. Google Scholar
  19. Costas S. Iliopoulos, Ritu Kundu, and Solon P. Pissis. Efficient pattern matching in elastic-degenerate texts. In LATA, volume 10168 of LNCS, pages 131-142. Springer International Publishing, 2017. Google Scholar
  20. Costas S. Iliopoulos and Jakub Radoszewski. Truly Subquadratic-Time Extension Queries and Periodicity Detection in Strings with Uncertainties. In CPM, volume 54 of LIPIcs, pages 8:1-8:12, Dagstuhl, Germany, 2016. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. Google Scholar
  21. Russell Impagliazzo and Ramamohan Paturi. On the complexity of k-sat. J. Comput. Syst. Sci., 62(2):367-375, 2001. Google Scholar
  22. Russell Impagliazzo, Ramamohan Paturi, and Francis Zane. Which problems have strongly exponential complexity? J. Comput. Syst. Sci., 63(4):512-530, 2001. Google Scholar
  23. IUPAC-IUB Commission on Biochemical Nomenclature. Abbreviations and symbols for nucleic acids, polynucleotides, and their constituents. Biochemistry, 9(20):4022-4027, 1970. Google Scholar
  24. Richard J. Lipton. On The Intersection of Finite Automata, pages 145-148. Springer US, Boston, MA, 2010. Google Scholar
  25. Glenn Manacher. A new linear-time "on-line" algorithm for finding the smallest initial palindrome of a string. Journal of the ACM, 22(3):346-351, 1975. Google Scholar
  26. Lee Ann McCue, William Thompson, Steven Carmack, Michael P. Ryan, Jun S. Liu, Victoria Derbyshire, and Charles E. Lawrence. Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes. Nucleic Acids Res., 29(3):774-782, 2001. Google Scholar
  27. Brejnev Muhizi Muhire, Michael Golden, Ben Murrell, Pierre Lefeuvre, Jean-Michel Lett, Alistair Gray, Art YF Poon, Nobubelo Kwanele Ngandu, Yves Semegni, Emil Pavlov Tanov, et al. Evidence of pervasive biologically functional secondary structures within the genomes of eukaryotic single-stranded DNA viruses. Journal of virology, 88(4):1972-1989, 2014. Google Scholar
  28. Eugene W Myers. Approximate matching of network expressions with spacers. Journal of Computational Biology, 3(1):33-51, 1996. Google Scholar
  29. Nadia Pisanti, Henry Soldano, Mathilde Carpentier, and Joël Pothier. A relational extension of the notion of motifs: Application to the common 3d protein substructures searching problem. Journal of Computational Biology, 16(12):1635-1660, 2009. Google Scholar
  30. Mikhail Rubinchik and Arseny M. Shur. Eertree: An efficient data structure for processing palindromes in strings. In IWOCA, volume 9538 of LNCS, pages 321-333. Springer International Publishing, 2016. Google Scholar
  31. Randall T. Schuh. Major patterns in vertebrate evolution. Systematic Biology, 27(2):172, 1978. Google Scholar
  32. Henry Soldano, Alain Viari, and Marc Champesme. Searching for flexible repeated patterns using a non-transitive similarity relation. Pattern Recognition Letters, 16(3):233-246, 1995. Google Scholar
  33. Ryan Williams. A new algorithm for optimal 2-constraint satisfaction and its implications. Theor. Comput. Sci, 348(2-3):357-365, 2005. Google Scholar
  34. C. Wuilmart, J. Urbain, and D. Givol. On the location of palindromes in immunoglobulin genes. Proceedings of the National Academy of Sciences of the United States of America, 74(6):2526-2530, 1977. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail