Approximate Suffix-Prefix Dictionary Queries

Authors Wiktor Zuba , Grigorios Loukides , Solon P. Pissis , Sharma V. Thankachan



PDF
Thumbnail PDF

File

LIPIcs.MFCS.2024.85.pdf
  • Filesize: 0.86 MB
  • 18 pages

Document Identifiers

Author Details

Wiktor Zuba
  • CWI, Amsterdam, The Netherlands
Grigorios Loukides
  • King’s College London, UK
Solon P. Pissis
  • CWI, Amsterdam, The Netherlands
  • Vrije Universiteit, Amsterdam, The Netherlands
Sharma V. Thankachan
  • North Carolina State University, Raleigh, NC, USA

Cite AsGet BibTex

Wiktor Zuba, Grigorios Loukides, Solon P. Pissis, and Sharma V. Thankachan. Approximate Suffix-Prefix Dictionary Queries. In 49th International Symposium on Mathematical Foundations of Computer Science (MFCS 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 306, pp. 85:1-85:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.MFCS.2024.85

Abstract

In the all-pairs suffix-prefix (APSP) problem [Gusfield et al., Inf. Process. Lett. 1992], we are given a dictionary R of r strings, S₁,…,S_r, of total length n, and we are asked to find the length SPL_{i,j} of the longest string that is both a suffix of S_i and a prefix of S_j, for all i,j ∈ [1..r]. APSP is a classic problem in string algorithms with applications in bioinformatics, especially in sequence assembly. Since r = |R| is typically very large in real-world applications, considering all r² pairs of strings explicitly is prohibitive. This is when the data structure variant of APSP makes sense; in the same spirit as distance oracles computing shortest paths between any two vertices given online. We show how to quickly locate k-approximate matches (under the Hamming or the edit distance) in R using a version of the k-errata tree [Cole et al., STOC 2004] that we introduce. Let SPL^k_{i,j} be the length of the longest suffix of S_i that is at distance at most k from a prefix of S_j. In particular, for any k = 𝒪(1), we show an 𝒪(nlog^k n)-sized data structure to support the following queries: - One-to-One^k(i,j): output SPL^k_{i,j} in 𝒪(log^k nlog log n) time. - Report^k(i,d): output all j ∈ [1..r], such that SPL^k_{i,j} ≥ d, in 𝒪(log^{k}n(log n/log log n+output)) time, where output denotes the size of the output. In fact, our algorithms work for any value of k not just for k = 𝒪(1), but the formulas bounding the complexities get much more complicated for larger values of k.

Subject Classification

ACM Subject Classification
  • Theory of computation → Pattern matching
Keywords
  • all-pairs suffix-prefix
  • suffix-prefix queries
  • suffix tree
  • k-errata tree

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Paniz Abedin, Arnab Ganguly, Solon P. Pissis, and Sharma V. Thankachan. Efficient data structures for range shortest unique substring queries. Algorithms, 13(11):276, 2020. URL: https://doi.org/10.3390/A13110276.
  2. Amihood Amir, Panagiotis Charalampopoulos, Solon P. Pissis, and Jakub Radoszewski. Dynamic and internal longest common substring. Algorithmica, 82(12):3707-3743, 2020. URL: https://doi.org/10.1007/s00453-020-00744-0.
  3. Amihood Amir, Gad M. Landau, Moshe Lewenstein, and Dina Sokol. Dynamic text and static pattern matching. ACM Trans. Algorithms, 3(2):19, 2007. URL: https://doi.org/10.1145/1240233.1240242.
  4. Golnaz Badkobeh, Panagiotis Charalampopoulos, Dmitry Kosolobov, and Solon P. Pissis. Internal shortest absent word queries in constant time and linear space. Theor. Comput. Sci., 922:271-282, 2022. URL: https://doi.org/10.1016/j.tcs.2022.04.029.
  5. Carl Barton, Costas S. Iliopoulos, Solon P. Pissis, and William F. Smyth. Fast and simple computations using prefix tables under hamming and edit distance. In Jan Kratochvíl, Mirka Miller, and Dalibor Froncek, editors, Combinatorial Algorithms - 25th International Workshop, IWOCA 2014, Duluth, MN, USA, October 15-17, 2014, Revised Selected Papers, volume 8986 of Lecture Notes in Computer Science, pages 49-61. Springer, 2014. URL: https://doi.org/10.1007/978-3-319-19315-1_5.
  6. Djamal Belazzougui and Gonzalo Navarro. Optimal lower and upper bounds for representing sequences. ACM Trans. Algorithms, 11(4):31:1-31:21, 2015. URL: https://doi.org/10.1145/2629339.
  7. Ilan Ben-Bassat and Benny Chor. String graph construction using incremental hashing. Bioinform., 30(24):3515-3523, 2014. URL: https://doi.org/10.1093/bioinformatics/btu578.
  8. Michael A. Bender, Alex Conway, Martin Farach-Colton, William Kuszmaul, and Guido Tagliavini. Iceberg hashing: Optimizing many hash-table criteria at once. J. ACM, 70(6):40:1-40:51, 2023. URL: https://doi.org/10.1145/3625817.
  9. Michael A. Bender and Martin Farach-Colton. The LCA problem revisited. In Gaston H. Gonnet, Daniel Panario, and Alfredo Viola, editors, LATIN 2000: Theoretical Informatics, 4th Latin American Symposium, Punta del Este, Uruguay, April 10-14, 2000, Proceedings, volume 1776 of Lecture Notes in Computer Science, pages 88-94. Springer, 2000. URL: https://doi.org/10.1007/10719839_9.
  10. Paola Bonizzoni, Gianluca Della Vedova, Yuri Pirola, Marco Previtali, and Raffaella Rizzi. FSG: fast string graph construction for de novo assembly. J. Comput. Biol., 24(10):953-968, 2017. URL: https://doi.org/10.1089/cmb.2017.0089.
  11. Panagiotis Charalampopoulos, Pawel Gawrychowski, Yaowei Long, Shay Mozes, Seth Pettie, Oren Weimann, and Christian Wulff-Nilsen. Almost optimal exact distance oracles for planar graphs. J. ACM, 70(2):12:1-12:50, 2023. URL: https://doi.org/10.1145/3580474.
  12. Panagiotis Charalampopoulos, Tomasz Kociumaka, Manal Mohamed, Jakub Radoszewski, Wojciech Rytter, and Tomasz Walen. Internal dictionary matching. Algorithmica, 83(7):2142-2169, 2021. URL: https://doi.org/10.1007/s00453-021-00821-y.
  13. Shiri Chechik. Approximate distance oracles with constant query time. In David B. Shmoys, editor, Symposium on Theory of Computing, STOC 2014, New York, NY, USA, May 31 - June 03, 2014, pages 654-663. ACM, 2014. URL: https://doi.org/10.1145/2591796.2591801.
  14. Shiri Chechik. Approximate distance oracles with improved bounds. In Rocco A. Servedio and Ronitt Rubinfeld, editors, Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, Portland, OR, USA, June 14-17, 2015, pages 1-10. ACM, 2015. URL: https://doi.org/10.1145/2746539.2746562.
  15. Richard Cole, Lee-Ad Gottlieb, and Moshe Lewenstein. Dictionary matching and indexing with errors and don't cares. In László Babai, editor, Proceedings of the 36th Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, June 13-16, 2004, pages 91-100. ACM, 2004. URL: https://doi.org/10.1145/1007352.1007374.
  16. Martin Farach. Optimal suffix tree construction with large alphabets. In 38th Annual Symposium on Foundations of Computer Science, FOCS '97, Miami Beach, Florida, USA, October 19-22, 1997, pages 137-143. IEEE Computer Society, 1997. URL: https://doi.org/10.1109/SFCS.1997.646102.
  17. Sebastian Forster, Gramoz Goranci, Yasamin Nazari, and Antonis Skarlatos. Bootstrapping dynamic distance oracles. In Inge Li Gørtz, Martin Farach-Colton, Simon J. Puglisi, and Grzegorz Herman, editors, 31st Annual European Symposium on Algorithms, ESA 2023, September 4-6, 2023, Amsterdam, The Netherlands, volume 274 of LIPIcs, pages 50:1-50:16. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2023. URL: https://doi.org/10.4230/LIPIcs.ESA.2023.50.
  18. Dan Gusfield. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology. Cambridge University Press, 1997. URL: https://doi.org/10.1017/cbo9780511574931.
  19. Dan Gusfield, Gad M. Landau, and Baruch Schieber. An efficient algorithm for the all pairs suffix-prefix problem. Inf. Process. Lett., 41(4):181-185, 1992. URL: https://doi.org/10.1016/0020-0190(92)90176-V.
  20. Tomasz Kociumaka. Efficient data structures for internal queries in texts. PhD thesis, University of Warsaw, October 2018., 2018. URL: https://https://www.mimuw.edu.pl/~kociumaka/files/phd.pdf.
  21. Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, and Tomasz Walen. Internal pattern matching queries in a text and applications. In Piotr Indyk, editor, Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2015, San Diego, CA, USA, January 4-6, 2015, pages 532-551. SIAM, 2015. URL: https://doi.org/10.1137/1.9781611973730.36.
  22. Gregory Kucherov and Dekel Tsur. Improved filters for the approximate suffix-prefix overlap problem. In Edleno Silva de Moura and Maxime Crochemore, editors, String Processing and Information Retrieval - 21st International Symposium, SPIRE 2014, Ouro Preto, Brazil, October 20-22, 2014. Proceedings, volume 8799 of Lecture Notes in Computer Science, pages 139-148. Springer, 2014. URL: https://doi.org/10.1007/978-3-319-11918-2_14.
  23. Gad M. Landau and Uzi Vishkin. Fast parallel and serial approximate string matching. J. Algorithms, 10(2):157-169, 1989. URL: https://doi.org/10.1016/0196-6774(89)90010-2.
  24. Grigorios Loukides and Solon P. Pissis. All-pairs suffix/prefix in optimal time using Aho-Corasick space. Inf. Process. Lett., 178:106275, 2022. URL: https://doi.org/10.1016/j.ipl.2022.106275.
  25. Grigorios Loukides, Solon P. Pissis, Sharma V. Thankachan, and Wiktor Zuba. Suffix-prefix queries on a dictionary. In Laurent Bulteau and Zsuzsanna Lipták, editors, 34th Annual Symposium on Combinatorial Pattern Matching, CPM 2023, June 26-28, 2023, Marne-la-Vallée, France, volume 259 of LIPIcs, pages 21:1-21:20. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2023. URL: https://doi.org/10.4230/LIPIcs.CPM.2023.21.
  26. Eugene W. Myers. The fragment assembly string graph. Bioinformatics, 21(suppl_2):ii79-ii85, September 2005. URL: https://doi.org/10.1093/bioinformatics/bti1114.
  27. Enno Ohlebusch and Simon Gog. Efficient algorithms for the all-pairs suffix-prefix problem and the all-pairs substring-prefix problem. Inf. Process. Lett., 110(3):123-128, 2010. URL: https://doi.org/10.1016/j.ipl.2009.10.015.
  28. Mihai Patrascu and Liam Roditty. Distance oracles beyond the thorup-zwick bound. SIAM J. Comput., 43(1):300-311, 2014. URL: https://doi.org/10.1137/11084128X.
  29. Kim R. Rasmussen, Jens Stoye, and Eugene W. Myers. Efficient q-gram filters for finding all epsilon-matches over a given length. J. Comput. Biol., 13(2):296-308, 2006. URL: https://doi.org/10.1089/cmb.2006.13.296.
  30. Qingmin Shi and Joseph F. JáJá. Novel transformation techniques using q-heaps with applications to computational geometry. SIAM J. Comput., 34(6):1474-1492, 2005. URL: https://doi.org/10.1137/S0097539703435728.
  31. Jared T. Simpson and Richard Durbin. Efficient construction of an assembly string graph using the FM-index. Bioinform., 26(12):367-373, 2010. URL: https://doi.org/10.1093/bioinformatics/btq217.
  32. Daniel Dominic Sleator and Robert Endre Tarjan. A data structure for dynamic trees. J. Comput. Syst. Sci., 26(3):362-391, 1983. URL: https://doi.org/10.1016/0022-0000(83)90006-5.
  33. Saumya Talera, Parth Bansal, Shabnam Khan, and Shahbaz Khan. Practical algorithms for hierarchical overlap graphs. CoRR, abs/2402.13920, 2024. URL: https://doi.org/10.48550/arXiv.2402.13920.
  34. Sharma V. Thankachan, Chaitanya Aluru, Sriram P. Chockalingam, and Srinivas Aluru. Algorithmic framework for approximate matching under bounded edits with applications to sequence analysis. In Benjamin J. Raphael, editor, Research in Computational Molecular Biology - 22nd Annual International Conference, RECOMB 2018, Paris, France, April 21-24, 2018, Proceedings, volume 10812 of Lecture Notes in Computer Science, pages 211-224. Springer, 2018. URL: https://doi.org/10.1007/978-3-319-89929-9_14.
  35. Mikkel Thorup and Uri Zwick. Approximate distance oracles. J. ACM, 52(1):1-24, 2005. URL: https://doi.org/10.1145/1044731.1044732.
  36. William H. A. Tustumi, Simon Gog, Guilherme P. Telles, and Felipe A. Louza. An improved algorithm for the all-pairs suffix-prefix problem. J. Discrete Algorithms, 37:34-43, 2016. URL: https://doi.org/10.1016/j.jda.2016.04.002.
  37. Niko Välimäki, Susana Ladra, and Veli Mäkinen. Approximate all-pairs suffix/prefix overlaps. Inf. Comput., 213:49-58, 2012. URL: https://doi.org/10.1016/j.ic.2012.02.002.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail