Gapped String Indexing in Subquadratic Space and Sublinear Query Time

Authors Philip Bille , Inge Li Gørtz , Moshe Lewenstein , Solon P. Pissis , Eva Rotenberg , Teresa Anna Steiner



PDF
Thumbnail PDF

File

LIPIcs.STACS.2024.16.pdf
  • Filesize: 0.97 MB
  • 21 pages

Document Identifiers

Author Details

Philip Bille
  • Technical University of Denmark, Lyngby, Denmark
Inge Li Gørtz
  • Technical University of Denmark, Lyngby, Denmark
Moshe Lewenstein
  • Bar-Ilan University, Ramat-Gan, Israel
Solon P. Pissis
  • CWI, Amsterdam, The Netherlands
  • Vrije Universiteit, Amsterdam, The Netherlands
Eva Rotenberg
  • Technical University of Denmark, Lyngby, Denmark
Teresa Anna Steiner
  • Technical University of Denmark, Lyngby, Denmark

Cite AsGet BibTex

Philip Bille, Inge Li Gørtz, Moshe Lewenstein, Solon P. Pissis, Eva Rotenberg, and Teresa Anna Steiner. Gapped String Indexing in Subquadratic Space and Sublinear Query Time. In 41st International Symposium on Theoretical Aspects of Computer Science (STACS 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 289, pp. 16:1-16:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.STACS.2024.16

Abstract

In Gapped String Indexing, the goal is to compactly represent a string S of length n such that for any query consisting of two strings P₁ and P₂, called patterns, and an integer interval [α, β], called gap range, we can quickly find occurrences of P₁ and P₂ in S with distance in [α, β]. Gapped String Indexing is a central problem in computational biology and text mining and has thus received significant research interest, including parameterized and heuristic approaches. Despite this interest, the best-known time-space trade-offs for Gapped String Indexing are the straightforward 𝒪(n) space and 𝒪(n+ occ) query time or Ω(n²) space and Õ(|P₁| + |P₂| + occ) query time. We break through this barrier obtaining the first interesting trade-offs with polynomially subquadratic space and polynomially sublinear query time. In particular, we show that, for every 0 ≤ δ ≤ 1, there is a data structure for Gapped String Indexing with either Õ(n^{2-δ/3}) or Õ(n^{3-2δ}) space and Õ(|P₁| + |P₂| + n^{δ}⋅ (occ+1)) query time, where occ is the number of reported occurrences. As a new fundamental tool towards obtaining our main result, we introduce the Shifted Set Intersection problem: preprocess a collection of sets S₁, …, S_k of integers such that for any query consisting of three integers i,j,s, we can quickly output YES if and only if there exist a ∈ S_i and b ∈ S_j with a+s = b. We start by showing that the Shifted Set Intersection problem is equivalent to the indexing variant of 3SUM (3SUM Indexing) [Golovnev et al., STOC 2020]. We then give a data structure for Shifted Set Intersection with gaps, which entails a solution to the Gapped String Indexing problem. Furthermore, we enhance our data structure for deciding Shifted Set Intersection, so that we can support the reporting variant of the problem, i.e., outputting all certificates in the affirmative case. Via the obtained equivalence to 3SUM Indexing, we thus give new improved data structures for the reporting variant of 3SUM Indexing, and we show how this improves upon the state-of-the-art solution for Jumbled Indexing [Chan and Lewenstein, STOC 2015] for any alphabet of constant size σ > 5.

Subject Classification

ACM Subject Classification
  • Theory of computation → Pattern matching
Keywords
  • data structures
  • string indexing
  • indexing with gaps
  • two patterns

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Peyman Afshani, Ingo van Duijn, Rasmus Killmann, and Jesper Sindahl Nielsen. A lower bound for jumbled indexing. In Proceedings of the 2020 ACM-SIAM Symposium on Discrete Algorithms, SODA 2020, Salt Lake City, UT, USA, January 5-8, 2020, pages 592-606, 2020. URL: https://doi.org/10.1137/1.9781611975994.36.
  2. Amihood Amir, Timothy M. Chan, Moshe Lewenstein, and Noa Lewenstein. On hardness of jumbled indexing. In Automata, Languages, and Programming - 41st International Colloquium, ICALP 2014, Copenhagen, Denmark, July 8-11, 2014, Proceedings, Part I, pages 114-125, 2014. URL: https://doi.org/10.1007/978-3-662-43948-7_10.
  3. Boris Aronov, Jean Cardinal, Justin Dallant, and John Iacono. A general technique for searching in implicit sets via function inversion. In 2024 Symposium on Simplicity in Algorithms (SOSA), pages 215-223, 2024. URL: https://doi.org/10.1137/1.9781611977936.20.
  4. Johannes Bader, Simon Gog, and Matthias Petri. Practical variable length gap pattern matching. In Experimental Algorithms - 15th International Symposium, SEA 2016, St. Petersburg, Russia, June 5-8, 2016, Proceedings, pages 1-16, 2016. URL: https://doi.org/10.1007/978-3-319-38851-9_1.
  5. Philip Bille and Inge Li Gørtz. Substring range reporting. Algorithmica, 69(2):384-396, 2014. URL: https://doi.org/10.1007/S00453-012-9733-4.
  6. Philip Bille, Inge Li Gørtz, Moshe Lewenstein, Solon P. Pissis, Eva Rotenberg, and Teresa Anna Steiner. Gapped string indexing in subquadratic space and sublinear query time, 2022. URL: https://doi.org/10.48550/ARXIV.2211.16860.
  7. Philip Bille, Inge Li Gørtz, Max Rishøj Pedersen, and Teresa Anna Steiner. Gapped indexing for consecutive occurrences. Algorithmica, 85(4):879-901, 2023. URL: https://doi.org/10.1007/S00453-022-01051-6.
  8. Philip Bille, Inge Li Gørtz, Hjalte Wedel Vildhøj, and Søren Vind. String indexing for patterns with wildcards. Theory Comput. Syst., 55(1):41-60, 2014. URL: https://doi.org/10.1007/S00224-013-9498-4.
  9. Philip Bille, Inge Li Gørtz, Hjalte Wedel Vildhøj, and David Kofoed Wind. String matching with variable length gaps. Theor. Comput. Sci., 443:25-34, 2012. URL: https://doi.org/10.1016/J.TCS.2012.03.029.
  10. Philip Bille and Mikkel Thorup. Regular expression matching with multi-strings and intervals. In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2010, Austin, Texas, USA, January 17-19, 2010, pages 1297-1308, 2010. URL: https://doi.org/10.1137/1.9781611973075.104.
  11. Philipp Bucher and Amos Bairoch. A generalized profile syntax for biomolecular sequence motifs and its function in automatic sequence interpretation. In Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, August 14-17, 1994, Stanford University, Stanford, California, USA, pages 53-61, 1994. URL: http://www.aaai.org/Library/ISMB/1994/ismb94-007.php.
  12. Manuel Cáceres, Simon J. Puglisi, and Bella Zhukova. Fast indexes for gapped pattern matching. In SOFSEM 2020: Theory and Practice of Computer Science - 46th International Conference on Current Trends in Theory and Practice of Informatics, SOFSEM 2020, Limassol, Cyprus, January 20-24, 2020, Proceedings, pages 493-504, 2020. URL: https://doi.org/10.1007/978-3-030-38919-2_40.
  13. Timothy M. Chan. More logarithmic-factor speedups for 3SUM, (median, +)-convolution, and some geometric 3SUM-hard problems. ACM Trans. Algorithms, 16(1):7:1-7:23, 2020. URL: https://doi.org/10.1145/3363541.
  14. Timothy M. Chan and Moshe Lewenstein. Clustered integer 3sum via additive combinatorics. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, Portland, OR, USA, June 14-17, 2015, pages 31-40, 2015. URL: https://doi.org/10.1145/2746539.2746568.
  15. Shucheng Chi, Ran Duan, Tianle Xie, and Tianyi Zhang. Faster min-plus product for monotone instances. In STOC '22: 54th Annual ACM SIGACT Symposium on Theory of Computing, Rome, Italy, June 20 - 24, 2022, pages 1529-1542, 2022. URL: https://doi.org/10.1145/3519935.3520057.
  16. Ferdinando Cicalese, Gabriele Fici, and Zsuzsanna Lipták. Searching for jumbled patterns in strings. In Proceedings of the Prague Stringology Conference 2009, Prague, Czech Republic, August 31 - September 2, 2009, pages 105-117, 2009. URL: http://www.stringology.org/event/2009/p10.html.
  17. Richard Cole, Lee-Ad Gottlieb, and Moshe Lewenstein. Dictionary matching and indexing with errors and don't cares. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, June 13-16, 2004, pages 91-100, 2004. URL: https://doi.org/10.1145/1007352.1007374.
  18. Richard Cole, Tsvi Kopelowitz, and Moshe Lewenstein. Suffix trays and suffix trists: Structures for faster text indexing. Algorithmica, 72(2):450-466, 2015. URL: https://doi.org/10.1007/S00453-013-9860-6.
  19. Maxime Crochemore, Christophe Hancart, and Thierry Lecroq. Algorithms on strings. Cambridge University Press, 2007. Google Scholar
  20. Martin Farach. Optimal suffix tree construction with large alphabets. In 38th Annual Symposium on Foundations of Computer Science, FOCS '97, Miami Beach, Florida, USA, October 19-22, 1997, pages 137-143, 1997. URL: https://doi.org/10.1109/SFCS.1997.646102.
  21. Amos Fiat and Moni Naor. Rigorous time/space trade-offs for inverting functions. SIAM J. Comput., 29(3):790-803, 1999. URL: https://doi.org/10.1137/S0097539795280512.
  22. Johannes Fischer and Pawel Gawrychowski. Alphabet-dependent string searching with wexponential search trees. In Combinatorial Pattern Matching - 26th Annual Symposium, CPM 2015, Ischia Island, Italy, June 29 - July 1, 2015, Proceedings, pages 160-171, 2015. URL: https://doi.org/10.1007/978-3-319-19929-0_14.
  23. Michael L. Fredman, János Komlós, and Endre Szemerédi. Storing a sparse table with 0(1) worst case access time. J. ACM, 31(3):538-544, 1984. URL: https://doi.org/10.1145/828.1884.
  24. Kimmo Fredriksson and Szymon Grabowski. Efficient algorithms for pattern matching with general gaps, character classes, and transposition invariance. Inf. Retr., 11(4):335-357, 2008. URL: https://doi.org/10.1007/S10791-008-9054-Z.
  25. Pawel Gawrychowski, Moshe Lewenstein, and Patrick K. Nicholson. Weighted ancestors in suffix trees. In Algorithms - ESA 2014 - 22th Annual European Symposium, Wroclaw, Poland, September 8-10, 2014. Proceedings, pages 455-466, 2014. URL: https://doi.org/10.1007/978-3-662-44777-2_38.
  26. Isaac Goldstein, Tsvi Kopelowitz, Moshe Lewenstein, and Ely Porat. Conditional lower bounds for space/time tradeoffs. In Algorithms and Data Structures - 15th International Symposium, WADS 2017, St. John’s, NL, Canada, July 31 - August 2, 2017, Proceedings, pages 421-436, 2017. URL: https://doi.org/10.1007/978-3-319-62127-2_36.
  27. Alexander Golovnev, Siyao Guo, Thibaut Horel, Sunoo Park, and Vinod Vaikuntanathan. Data structures meet cryptography: 3SUM with preprocessing. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, Chicago, IL, USA, June 22-26, 2020, pages 294-307, 2020. URL: https://doi.org/10.1145/3357713.3384342.
  28. Dan Gusfield. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology. Cambridge University Press, 1997. URL: https://doi.org/10.1017/cbo9780511574931.
  29. Tuukka Haapasalo, Panu Silvasti, Seppo Sippu, and Eljas Soisalon-Soininen. Online dictionary matching with variable-length gaps. In Experimental Algorithms - 10th International Symposium, SEA 2011, Kolimpari, Chania, Crete, Greece, May 5-7, 2011. Proceedings, pages 76-87, 2011. URL: https://doi.org/10.1007/978-3-642-20662-7_7.
  30. Kay Hofmann, Philipp Bucher, Laurent Falquet, and Amos Bairoch. The PROSITE database, its status in 1999. Nucleic Acids Res., 27(1):215-219, 1999. URL: https://doi.org/10.1093/NAR/27.1.215.
  31. Costas S. Iliopoulos and M. Sohel Rahman. Indexing factors with gaps. Algorithmica, 55(1):60-70, 2009. URL: https://doi.org/10.1007/S00453-007-9141-3.
  32. Toru Kasai, Gunho Lee, Hiroki Arimura, Setsuo Arikawa, and Kunsoo Park. Linear-time longest-common-prefix computation in suffix arrays and its applications. In Combinatorial Pattern Matching, 12th Annual Symposium, CPM 2001 Jerusalem, Israel, July 1-4, 2001 Proceedings, pages 181-192, 2001. URL: https://doi.org/10.1007/3-540-48194-X_17.
  33. Donald E. Knuth, James H. Morris Jr., and Vaughan R. Pratt. Fast pattern matching in strings. SIAM J. Comput., 6(2):323-350, 1977. URL: https://doi.org/10.1137/0206024.
  34. Tomasz Kociumaka, Jakub Radoszewski, and Wojciech Rytter. Efficient indexes for jumbled pattern matching with constant-sized alphabet. Algorithmica, 77(4):1194-1215, 2017. URL: https://doi.org/10.1007/S00453-016-0140-0.
  35. Tsvi Kopelowitz and Robert Krauthgamer. Color-distance oracles and snippets. In 27th Annual Symposium on Combinatorial Pattern Matching, CPM 2016, June 27-29, 2016, Tel Aviv, Israel, pages 24:1-24:10, 2016. URL: https://doi.org/10.4230/LIPICS.CPM.2016.24.
  36. Tsvi Kopelowitz and Ely Porat. The strong 3SUM-INDEXING conjecture is false. CoRR, abs/1907.11206, 2019. URL: https://arxiv.org/abs/1907.11206.
  37. Paul R. Kroeger. Analyzing Grammar: An Introduction. Cambridge University Press, Cambridge, 2005. Google Scholar
  38. Moshe Lewenstein. Indexing with gaps. In String Processing and Information Retrieval, 18th International Symposium, SPIRE 2011, Pisa, Italy, October 17-21, 2011. Proceedings, pages 135-143, 2011. URL: https://doi.org/10.1007/978-3-642-24583-1_14.
  39. Moshe Lewenstein, J. Ian Munro, Venkatesh Raman, and Sharma V. Thankachan. Less space: Indexing for queries with wildcards. Theor. Comput. Sci., 557:120-127, 2014. URL: https://doi.org/10.1016/j.tcs.2014.09.003.
  40. Moshe Lewenstein, Yakov Nekrich, and Jeffrey Scott Vitter. Space-efficient string indexing for wildcard pattern matching. In 31st International Symposium on Theoretical Aspects of Computer Science (STACS 2014), STACS 2014, March 5-8, 2014, Lyon, France, pages 506-517, 2014. URL: https://doi.org/10.4230/LIPICS.STACS.2014.506.
  41. Udi Manber and Eugene W. Myers. Suffix arrays: A new method for on-line string searches. SIAM J. Comput., 22(5):935-948, 1993. URL: https://doi.org/10.1137/0222058.
  42. Christopher D. Manning and Hinrich Schütze. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA, 1999. Google Scholar
  43. Gerhard Mehldau and Gene Myers. A system for pattern matching applications on biosequences. Bioinformatics, 9(3):299-314, 1993. URL: https://doi.org/10.1093/bioinformatics/9.3.299.
  44. Gary Miner, Dursun Delen, John Elder, Andrew Fast, Thomas Hill, and Robert A. Nisbet. Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications. Academic Press, Boston, 2012. Google Scholar
  45. Michele Morgante, Alberto Policriti, Nicola Vitacolonna, and Andrea Zuccolo. Structured motifs search. J. Comput. Biol., 12(8):1065-1082, 2005. URL: https://doi.org/10.1089/cmb.2005.12.1065.
  46. Eugene W. Myers. Approximate matching of network expressions with spacers. J. Comput. Biol., 3(1):33-51, 1996. URL: https://doi.org/10.1089/cmb.1996.3.33.
  47. Gonzalo Navarro and Yakov Nekrich. Time-optimal top-k document retrieval. SIAM J. Comput., 46(1):80-113, 2017. URL: https://doi.org/10.1137/140998949.
  48. Gonzalo Navarro and Mathieu Raffinot. Fast and simple character classes and bounded gaps pattern matching, with applications to protein searching. J. Comput. Biol., 10(6):903-923, 2003. URL: https://doi.org/10.1089/106652703322756140.
  49. Pierre Peterlongo, Julien Allali, and Marie-France Sagot. Indexing gapped-factors using a tree. Int. J. Found. Comput. Sci., 19(1):71-87, 2008. URL: https://doi.org/10.1142/S0129054108005541.
  50. Solon P. Pissis. MoTeX-II: structured MoTif eXtraction from large-scale datasets. BMC Bioinform., 15:235, 2014. URL: https://doi.org/10.1186/1471-2105-15-235.
  51. Ken Thompson. Regular expression search algorithm. Commun. ACM, 11(6):419-422, 1968. URL: https://doi.org/10.1145/363347.363387.
  52. Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory, Iowa City, Iowa, USA, October 15-17, 1973, pages 1-11, 1973. URL: https://doi.org/10.1109/SWAT.1973.13.
  53. Dan E. Willard. Log-logarithmic worst-case range queries are possible in space Θ(N). Inf. Process. Lett., 17(2):81-84, 1983. URL: https://doi.org/10.1016/0020-0190(83)90075-3.
  54. Jens Willkomm, Martin Schäler, and Klemens Böhm. Accurate cardinality estimation of co-occurring words using suffix trees. In Database Systems for Advanced Applications - 26th International Conference, DASFAA 2021, Taipei, Taiwan, April 11-14, 2021, Proceedings, Part II, pages 721-737, 2021. URL: https://doi.org/10.1007/978-3-030-73197-7_50.