Suffix-Prefix Queries on a Dictionary

Authors Grigorios Loukides , Solon P. Pissis , Sharma V. Thankachan , Wiktor Zuba



PDF
Thumbnail PDF

File

LIPIcs.CPM.2023.21.pdf
  • Filesize: 1.12 MB
  • 20 pages

Document Identifiers

Author Details

Grigorios Loukides
  • Department of Informatics, King’s College London, UK
Solon P. Pissis
  • CWI, Amsterdam, The Netherlands
  • Vrije Universiteit, Amsterdam, The Netherlands
Sharma V. Thankachan
  • North Carolina State University, Raleigh, NC, USA
Wiktor Zuba
  • CWI, Amsterdam, The Netherlands

Cite As Get BibTex

Grigorios Loukides, Solon P. Pissis, Sharma V. Thankachan, and Wiktor Zuba. Suffix-Prefix Queries on a Dictionary. In 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 259, pp. 21:1-21:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023) https://doi.org/10.4230/LIPIcs.CPM.2023.21

Abstract

In the all-pairs suffix-prefix (APSP) problem, we are given a dictionary R of k strings, S_1,…,S_k, of total length n, and we are asked to find the length SPL_{i,j} of the longest string that is both a suffix of S_i and a prefix of S_j, for all i,j ∈ [1,k]. APSP is a classic problem in string algorithms with many applications in bioinformatics. When all strings of the dictionary are over an integer alphabet of size σ ≤ n^𝒪(1), APSP can be solved in the optimal 𝒪(n+k²) time with the use of the generalized suffix tree of the dictionary [Gusfield et al., Inf. Process. Lett. 1992].
In many bioinformatics applications, such as in sequence assembly, the size k of dictionary R is very large. In particular, k² usually dominates n, and thus the k² factor is the bottleneck both in the time and in the space complexity of such applications. We thus initiate a holistic study on several data structure variants of APSP. In particular, we consider the following types of queries:  
- One-to-One(i,j): output SPL_{i,j}. 
- One-to-All(i): output SPL_{i,j} for every j ∈ [1,k]. 
- Report(i,𝓁): output all distinct j ∈ [1,k] such that SPL_{i,j} ≥ 𝓁, where 𝓁 ≥ 0 is an integer. 
- Count(i,𝓁): output the number of distinct j ∈ [1,k] such that SPL_{i,j} ≥ 𝓁, where 𝓁 ≥ 0 is an integer. 
- Top(i,K): output K distinct j ∈ [1,k] with the highest values of SPL_{i,j} breaking ties arbitrarily.
We assume the standard word RAM model of computation with word size w = Ω(log n) and an integer alphabet of size σ ≤ n^𝒪(1). We show the following upper bounds:

Query           | Space (words) | Query time                | Note
One-to-One(i,j) | 𝒪(n)          | 𝒪(log log k)              | Theorem 11
One-to-All(i)   | 𝒪(n)          | 𝒪(k)                      | Theorem 14
Report(i,𝓁)     | 𝒪(n)          | 𝒪(log n/log log n+output) | Theorem 19(i)
Count(i,𝓁)      | 𝒪(n)          | 𝒪(log n/log log n)        | Theorem 19(ii)
Top(i,K)        | 𝒪(n)          | 𝒪(log² n/log log n+K)     | Theorem 22

We also present efficient algorithms for constructing these data structures.

Subject Classification

ACM Subject Classification
  • Theory of computation → Pattern matching
Keywords
  • all-pairs suffix-prefix
  • suffix-prefix queries
  • internal pattern matching

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Alfred V. Aho and Margaret J. Corasick. Efficient string matching: An aid to bibliographic search. Commun. ACM, 18(6):333-340, 1975. URL: https://doi.org/10.1145/360825.360855.
  2. Stephen Alstrup, Jacob Holm, Kristian de Lichtenberg, and Mikkel Thorup. Minimizing diameters of dynamic trees. In Pierpaolo Degano, Roberto Gorrieri, and Alberto Marchetti-Spaccamela, editors, Automata, Languages and Programming, 24th International Colloquium, ICALP'97, Bologna, Italy, 7-11 July 1997, Proceedings, volume 1256 of Lecture Notes in Computer Science, pages 270-280. Springer, 1997. URL: https://doi.org/10.1007/3-540-63165-8_184.
  3. Amihood Amir, Panagiotis Charalampopoulos, Solon P. Pissis, and Jakub Radoszewski. Dynamic and internal longest common substring. Algorithmica, 82(12):3707-3743, 2020. URL: https://doi.org/10.1007/s00453-020-00744-0.
  4. Golnaz Badkobeh, Panagiotis Charalampopoulos, Dmitry Kosolobov, and Solon P. Pissis. Internal shortest absent word queries in constant time and linear space. Theor. Comput. Sci., 922:271-282, 2022. URL: https://doi.org/10.1016/j.tcs.2022.04.029.
  5. Carl Barton, Costas S. Iliopoulos, Solon P. Pissis, and William F. Smyth. Fast and simple computations using prefix tables under hamming and edit distance. In Jan Kratochvíl, Mirka Miller, and Dalibor Froncek, editors, Combinatorial Algorithms - 25th International Workshop, IWOCA 2014, Duluth, MN, USA, October 15-17, 2014, Revised Selected Papers, volume 8986 of Lecture Notes in Computer Science, pages 49-61. Springer, 2014. URL: https://doi.org/10.1007/978-3-319-19315-1_5.
  6. Djamal Belazzougui, Dmitry Kosolobov, Simon J. Puglisi, and Rajeev Raman. Weighted ancestors in suffix trees revisited. In Pawel Gawrychowski and Tatiana Starikovskaya, editors, 32nd Annual Symposium on Combinatorial Pattern Matching, CPM 2021, July 5-7, 2021, Wrocław, Poland, volume 191 of LIPIcs, pages 8:1-8:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021. URL: https://doi.org/10.4230/LIPIcs.CPM.2021.8.
  7. Djamal Belazzougui and Gonzalo Navarro. Optimal lower and upper bounds for representing sequences. ACM Trans. Algorithms, 11(4):31:1-31:21, 2015. URL: https://doi.org/10.1145/2629339.
  8. Ilan Ben-Bassat and Benny Chor. String graph construction using incremental hashing. Bioinform., 30(24):3515-3523, 2014. URL: https://doi.org/10.1093/bioinformatics/btu578.
  9. Michael A. Bender and Martin Farach-Colton. The LCA problem revisited. In Gaston H. Gonnet, Daniel Panario, and Alfredo Viola, editors, LATIN 2000: Theoretical Informatics, 4th Latin American Symposium, Punta del Este, Uruguay, April 10-14, 2000, Proceedings, volume 1776 of Lecture Notes in Computer Science, pages 88-94. Springer, 2000. URL: https://doi.org/10.1007/10719839_9.
  10. Philip Bille and Inge Li Gørtz. The tree inclusion problem: In linear space and faster. ACM Trans. Algorithms, 7(3):38:1-38:47, 2011. URL: https://doi.org/10.1145/1978782.1978793.
  11. Paola Bonizzoni, Gianluca Della Vedova, Yuri Pirola, Marco Previtali, and Raffaella Rizzi. FSG: fast string graph construction for de novo assembly. J. Comput. Biol., 24(10):953-968, 2017. URL: https://doi.org/10.1089/cmb.2017.0089.
  12. Bastien Cazaux and Eric Rivals. Hierarchical overlap graph. Inf. Process. Lett., 155, 2020. URL: https://doi.org/10.1016/j.ipl.2019.105862.
  13. Panagiotis Charalampopoulos, Pawel Gawrychowski, Shay Mozes, and Oren Weimann. Almost optimal distance oracles for planar graphs. In Moses Charikar and Edith Cohen, editors, Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, June 23-26, 2019, pages 138-151. ACM, 2019. URL: https://doi.org/10.1145/3313276.3316316.
  14. Panagiotis Charalampopoulos, Tomasz Kociumaka, Manal Mohamed, Jakub Radoszewski, Wojciech Rytter, and Tomasz Walen. Internal dictionary matching. Algorithmica, 83(7):2142-2169, 2021. URL: https://doi.org/10.1007/s00453-021-00821-y.
  15. Bernard Chazelle. A functional approach to data structures and its use in multidimensional searching. SIAM J. Comput., 17(3):427-462, 1988. URL: https://doi.org/10.1137/0217026.
  16. Shiri Chechik. Approximate distance oracles with constant query time. In David B. Shmoys, editor, Symposium on Theory of Computing, STOC 2014, New York, NY, USA, May 31 - June 03, 2014, pages 654-663. ACM, 2014. URL: https://doi.org/10.1145/2591796.2591801.
  17. Shiri Chechik. Approximate distance oracles with improved bounds. In Rocco A. Servedio and Ronitt Rubinfeld, editors, Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, Portland, OR, USA, June 14-17, 2015, pages 1-10. ACM, 2015. URL: https://doi.org/10.1145/2746539.2746562.
  18. Nicola Cotumaccio, Giovanna D'Agostino, Alberto Policriti, and Nicola Prezza. Co-lexicographically ordering automata and regular languages. part I. CoRR, abs/2208.04931, 2022. URL: https://doi.org/10.48550/arXiv.2208.04931.
  19. Maxime Crochemore, Christophe Hancart, and Thierry Lecroq. Algorithms on strings. Cambridge University Press, 2007. Google Scholar
  20. Shiri Dori and Gad M. Landau. Construction of aho corasick automaton in linear time for integer alphabets. Inf. Process. Lett., 98(2):66-72, 2006. URL: https://doi.org/10.1016/j.ipl.2005.11.019.
  21. Herbert Edelsbrunner and Mark H. Overmars. On the equivalence of some rectangle problems. Inf. Process. Lett., 14(3):124-127, 1982. URL: https://doi.org/10.1016/0020-0190(82)90068-0.
  22. Martin Farach. Optimal suffix tree construction with large alphabets. In 38th Annual Symposium on Foundations of Computer Science, FOCS '97, Miami Beach, Florida, USA, October 19-22, 1997, pages 137-143. IEEE Computer Society, 1997. URL: https://doi.org/10.1109/SFCS.1997.646102.
  23. Martin Farach and S. Muthukrishnan. Perfect hashing for strings: Formalization and algorithms. In Daniel S. Hirschberg and Eugene W. Myers, editors, Combinatorial Pattern Matching, 7th Annual Symposium, CPM 96, Laguna Beach, California, USA, June 10-12, 1996, Proceedings, volume 1075 of Lecture Notes in Computer Science, pages 130-140. Springer, 1996. URL: https://doi.org/10.1007/3-540-61258-0_11.
  24. Pawel Gawrychowski, Moshe Lewenstein, and Patrick K. Nicholson. Weighted ancestors in suffix trees. In Andreas S. Schulz and Dorothea Wagner, editors, Algorithms - ESA 2014 - 22th Annual European Symposium, Wroclaw, Poland, September 8-10, 2014. Proceedings, volume 8737 of Lecture Notes in Computer Science, pages 455-466. Springer, 2014. URL: https://doi.org/10.1007/978-3-662-44777-2_38.
  25. Giorgio Gonnella and Stefan Kurtz. Readjoiner: a fast and memory efficient string graph-based sequence assembler. BMC Bioinform., 13:82, 2012. URL: https://doi.org/10.1186/1471-2105-13-82.
  26. Dan Gusfield. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology. Cambridge University Press, 1997. URL: https://doi.org/10.1017/cbo9780511574931.
  27. Dan Gusfield, Gad M. Landau, and Baruch Schieber. An efficient algorithm for the all pairs suffix-prefix problem. Inf. Process. Lett., 41(4):181-185, 1992. URL: https://doi.org/10.1016/0020-0190(92)90176-V.
  28. Joseph F. JáJá, Christian Worm Mortensen, and Qingmin Shi. Space-efficient and fast algorithms for multidimensional dominance reporting and counting. In Rudolf Fleischer and Gerhard Trippen, editors, Algorithms and Computation, 15th International Symposium, ISAAC 2004, Hong Kong, China, December 20-22, 2004, Proceedings, volume 3341 of Lecture Notes in Computer Science, pages 558-568. Springer, 2004. URL: https://doi.org/10.1007/978-3-540-30551-4_49.
  29. Shahbaz Khan. Optimal construction of hierarchical overlap graphs. In Pawel Gawrychowski and Tatiana Starikovskaya, editors, 32nd Annual Symposium on Combinatorial Pattern Matching, CPM 2021, July 5-7, 2021, Wrocław, Poland, volume 191 of LIPIcs, pages 17:1-17:11. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021. URL: https://doi.org/10.4230/LIPIcs.CPM.2021.17.
  30. Tomasz Kociumaka. Efficient data structures for internal queries in texts. PhD thesis, University of Warsaw, October 2018., 2018. URL: https://www.mimuw.edu.pl/~kociumaka/files/phd.pdf.
  31. Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, and Tomasz Walen. Internal pattern matching queries in a text and applications. In Piotr Indyk, editor, Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2015, San Diego, CA, USA, January 4-6, 2015, pages 532-551. SIAM, 2015. URL: https://doi.org/10.1137/1.9781611973730.36.
  32. Gregory Kucherov and Dekel Tsur. Improved filters for the approximate suffix-prefix overlap problem. In Edleno Silva de Moura and Maxime Crochemore, editors, String Processing and Information Retrieval - 21st International Symposium, SPIRE 2014, Ouro Preto, Brazil, October 20-22, 2014. Proceedings, volume 8799 of Lecture Notes in Computer Science, pages 139-148. Springer, 2014. URL: https://doi.org/10.1007/978-3-319-11918-2_14.
  33. Jihyuk Lim and Kunsoo Park. A fast algorithm for the all-pairs suffix-prefix problem. Theor. Comput. Sci., 698:14-24, 2017. URL: https://doi.org/10.1016/j.tcs.2017.07.013.
  34. Grigorios Loukides and Solon P. Pissis. All-pairs suffix/prefix in optimal time using Aho-Corasick space. Inf. Process. Lett., 178:106275, 2022. URL: https://doi.org/10.1016/j.ipl.2022.106275.
  35. Felipe A. Louza, Simon Gog, Leandro Zanotto, Guido Araujo, and Guilherme P. Telles. Parallel computation for the all-pairs suffix-prefix problem. In Shunsuke Inenaga, Kunihiko Sadakane, and Tetsuya Sakai, editors, String Processing and Information Retrieval - 23rd International Symposium, SPIRE 2016, Beppu, Japan, October 18-20, 2016, Proceedings, volume 9954 of Lecture Notes in Computer Science, pages 122-132, 2016. URL: https://doi.org/10.1007/978-3-319-46049-9_12.
  36. Udi Manber and Eugene W. Myers. Suffix arrays: A new method for on-line string searches. SIAM J. Comput., 22(5):935-948, 1993. URL: https://doi.org/10.1137/0222058.
  37. Eugene W. Myers. The fragment assembly string graph. Bioinformatics, 21(suppl_2):ii79-ii85, September 2005. URL: https://doi.org/10.1093/bioinformatics/bti1114.
  38. Gonzalo Navarro. Compact Data Structures - A Practical Approach. Cambridge University Press, 2016. URL: http://www.cambridge.org/de/academic/subjects/computer-science/algorithmics-complexity-computer-algebra-and-computational-g/compact-data-structures-practical-approach?format=HB.
  39. Enno Ohlebusch and Simon Gog. Efficient algorithms for the all-pairs suffix-prefix problem and the all-pairs substring-prefix problem. Inf. Process. Lett., 110(3):123-128, 2010. URL: https://doi.org/10.1016/j.ipl.2009.10.015.
  40. Sangsoo Park, Sung Gwan Park, Bastien Cazaux, Kunsoo Park, and Eric Rivals. A linear time algorithm for constructing hierarchical overlap graphs. In Pawel Gawrychowski and Tatiana Starikovskaya, editors, 32nd Annual Symposium on Combinatorial Pattern Matching, CPM 2021, July 5-7, 2021, Wrocław, Poland, volume 191 of LIPIcs, pages 22:1-22:9. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021. URL: https://doi.org/10.4230/LIPIcs.CPM.2021.22.
  41. Mihai Patrascu and Liam Roditty. Distance oracles beyond the thorup-zwick bound. SIAM J. Comput., 43(1):300-311, 2014. URL: https://doi.org/10.1137/11084128X.
  42. Maan Haj Rachid and Qutaibah Malluhi. A practical and scalable tool to find overlaps between sequences. BioMed Res. Int., 2015(905261), 2015. URL: https://doi.org/10.1155/2015/905261.
  43. Rajeev Raman, Venkatesh Raman, and S. Srinivasa Rao. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In David Eppstein, editor, Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, January 6-8, 2002, San Francisco, CA, USA, pages 233-242. ACM/SIAM, 2002. URL: http://dl.acm.org/citation.cfm?id=545381.545411.
  44. Kim R. Rasmussen, Jens Stoye, and Eugene W. Myers. Efficient q-gram filters for finding all epsilon-matches over a given length. J. Comput. Biol., 13(2):296-308, 2006. URL: https://doi.org/10.1089/cmb.2006.13.296.
  45. Qingmin Shi and Joseph F. JáJá. Novel transformation techniques using q-heaps with applications to computational geometry. SIAM J. Comput., 34(6):1474-1492, 2005. URL: https://doi.org/10.1137/S0097539703435728.
  46. Jared T. Simpson and Richard Durbin. Efficient construction of an assembly string graph using the fm-index. Bioinform., 26(12):367-373, 2010. URL: https://doi.org/10.1093/bioinformatics/btq217.
  47. Sharma V. Thankachan, Chaitanya Aluru, Sriram P. Chockalingam, and Srinivas Aluru. Algorithmic framework for approximate matching under bounded edits with applications to sequence analysis. In Benjamin J. Raphael, editor, Research in Computational Molecular Biology - 22nd Annual International Conference, RECOMB 2018, Paris, France, April 21-24, 2018, Proceedings, volume 10812 of Lecture Notes in Computer Science, pages 211-224. Springer, 2018. URL: https://doi.org/10.1007/978-3-319-89929-9_14.
  48. Mikkel Thorup and Uri Zwick. Approximate distance oracles. J. ACM, 52(1):1-24, 2005. URL: https://doi.org/10.1145/1044731.1044732.
  49. William H. A. Tustumi, Simon Gog, Guilherme P. Telles, and Felipe A. Louza. An improved algorithm for the all-pairs suffix-prefix problem. J. Discrete Algorithms, 37:34-43, 2016. URL: https://doi.org/10.1016/j.jda.2016.04.002.
  50. Esko Ukkonen. A linear-time algorithm for finding approximate shortest common superstrings. Algorithmica, 5(3):313-323, 1990. URL: https://doi.org/10.1007/BF01840391.
  51. Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249-260, 1995. URL: https://doi.org/10.1007/BF01206331.
  52. Niko Välimäki, Susana Ladra, and Veli Mäkinen. Approximate all-pairs suffix/prefix overlaps. Inf. Comput., 213:49-58, 2012. URL: https://doi.org/10.1016/j.ic.2012.02.002.
  53. Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory, Iowa City, Iowa, USA, October 15-17, 1973, pages 1-11. IEEE Computer Society, 1973. URL: https://doi.org/10.1109/SWAT.1973.13.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail