Inverse Suffix Array Queries for 2-Dimensional Pattern Matching in Near-Compact Space

Authors Dhrumil Patel, Rahul Shah



PDF
Thumbnail PDF

File

LIPIcs.ISAAC.2021.60.pdf
  • Filesize: 0.94 MB
  • 14 pages

Document Identifiers

Author Details

Dhrumil Patel
  • Division of Computer Science, Louisiana State University, Baton Rouge, LA, USA
Rahul Shah
  • Division of Computer Science, Louisiana State University, Baton Rouge, LA, USA

Cite AsGet BibTex

Dhrumil Patel and Rahul Shah. Inverse Suffix Array Queries for 2-Dimensional Pattern Matching in Near-Compact Space. In 32nd International Symposium on Algorithms and Computation (ISAAC 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 212, pp. 60:1-60:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)
https://doi.org/10.4230/LIPIcs.ISAAC.2021.60

Abstract

In a 2-dimensional (2D) pattern matching problem, the text is arranged as a matrix 𝖬[1..n, 1..n] and consists of N = n × n symbols drawn from alphabet set Σ of size σ. The query consists of a m × m square matrix 𝖯[1..m, 1..m] drawn from the same alphabet set Σ and the task is to find all the locations in 𝖬 where 𝖯 appears as a (contiguous) submatrix. The patterns can be of any size, but as long as they are square in shape data structures like suffix trees and suffix array exist [Raffaele Giancarlo, 1995; Dong Kyue Kim et al., 1998] for the task of efficient pattern matching. These are essentially 2D counterparts of classic suffix trees and arrays known for traditional 1-dimensional (1D) pattern matching. They work based on linearization of 2D suffixes which would preserve the prefix match property (i.e., every pattern match is a prefix of some suffix). The main limitation of the suffix trees and the suffix arrays (in 1D) was their space utilization of O(N log N) bits, where N is the size of the text. This was suboptimal compared to Nlog σ bits of space, which is information theoretic optimal for the text. With the advent of the field of succinct/compressed data structures, it was possible to develop compressed variants of suffix trees and array based on Burrows-Wheeler Tansform and LF-mapping (or Φ function) [Roberto Grossi and Jeffrey Scott Vitter, 2005; Paolo Ferragina and Giovanni Manzini, 2005; Kunihiko Sadakane, 2007]. These data structures indeed achieve O(N log σ) bits of space or better. This gives rise to the question: analogous to 1D case, can we design a succinct or compressed index for 2D pattern matching? Can there be a 2D compressed suffix tree? Are there analogues of Burrows-Wheeler Transform or LF-mapping? The problem has been acknowledged for over a decade now and there have been a few attempts at applying Φ function [Ankur Gupta, 2004] and achieving entropy based compression [Veli Mäkinen and Gonzalo Navarro, 2008]. However, achieving the complexity breakthrough akin to 1D case has yet to be found. In this paper, we still do not know how to answer suffix array queries in O(N log σ) bits of space - which would have led to efficient pattern matching. However, for the first time, we show an interesting result that it is indeed possible to compute inverse suffix array (ISA) queries in near compact space in O(polylog n) time. Our 2D succinct text index design is based on two 1D compressed suffix trees and it takes O(N log log N + N logσ) bits of space which is much smaller than its naive design that takes O(N log N) bits. Although the main problem is still evasive, this index gives a hope on the existence of a full 2D succinct index with all functionalities similar to that of 1D case.

Subject Classification

ACM Subject Classification
  • Theory of computation → Pattern matching
Keywords
  • Pattern Matching
  • Succinct Data Structures

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Jeffrey Scott Vitter Ankur Gupta, Roberto Grossi. Entropy-compressed indexes for multidimensional pattern matching. In DIMACS working group on Burrows-Wheeler Transform, 2004. Google Scholar
  2. M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. Technical report, Digital Equipment Corporation (now part of Hewlett-Packard, Palo Alto, CA), 1994. Google Scholar
  3. Martin Farach-Colton, Paolo Ferragina, and S. Muthukrishnan. On the sorting-complexity of suffix tree construction. J. ACM, 47(6):987-1011, 2000. URL: https://doi.org/10.1145/355541.355547.
  4. Paolo Ferragina and Giovanni Manzini. Indexing compressed text. J. ACM, 52(4):552-581, 2005. An extended abstract appeared in FOCS 2000 under the title "Opportunistic Data Structures with Applications". URL: https://doi.org/10.1145/1082036.1082039.
  5. Raffaele Giancarlo. A generalization of the suffix tree to square matrices, with applications. SIAM J. Comput., 24(3):520-562, 1995. URL: https://doi.org/10.1137/S0097539792231982.
  6. Raffaele Giancarlo and Roberto Grossi. Suffix tree data structures for matrices. In Alberto Apostolico and Zvi Galil, editors, Pattern Matching Algorithms, pages 293-340. Oxford University Press, 1997. Google Scholar
  7. Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput., 35(2):378-407, 2005. An extended abstract appeared in STOC 2000. URL: https://doi.org/10.1137/S0097539702402354.
  8. Dong Kyue Kim, Yoo Ah Kim, and Kunsoo Park. Constructing suffix arrays for multi-dimensional matrices. In Martin Farach-Colton, editor, Combinatorial Pattern Matching, 9th Annual Symposium, CPM 98, Piscataway, New Jersey, USA, July 20-22, 1998, Proceedings, volume 1448 of Lecture Notes in Computer Science, pages 126-139. Springer, 1998. URL: https://doi.org/10.1007/BFb0030786.
  9. Donald E. Knuth, James H. Morris Jr., and Vaughan R. Pratt. Fast pattern matching in strings. SIAM J. Comput., 6(2):323-350, 1977. URL: https://doi.org/10.1137/0206024.
  10. Veli Mäkinen and Gonzalo Navarro. On self-indexing images - image compression with added value. In 2008 Data Compression Conference (DCC 2008), 25-27 March 2008, Snowbird, UT, USA, pages 422-431. IEEE Computer Society, 2008. URL: https://doi.org/10.1109/DCC.2008.47.
  11. Udi Manber and Eugene W. Myers. Suffix arrays: A new method for on-line string searches. SIAM J. Comput., 22(5):935-948, 1993. URL: https://doi.org/10.1137/0222058.
  12. Edward M. McCreight. A space-economical suffix tree construction algorithm. J. ACM, 23(2):262-272, 1976. URL: https://doi.org/10.1145/321941.321946.
  13. Gonzalo Navarro. Compact Data Structures - A Practical Approach. Cambridge University Press, 2016. URL: http://www.cambridge.org/de/academic/subjects/computer-science/algorithmics-complexity-computer-algebra-and-computational-g/compact-data-structures-practical-approach?format=HB.
  14. Yakov Nekrich. A dynamic stabbing-max data structure with sub-logarithmic query time. In Takao Asano, Shin-Ichi Nakano, Yoshio Okamoto, and Osamu Watanabe, editors, Algorithms and Computation - 22nd International Symposium, ISAAC 2011, Yokohama, Japan, December 5-8, 2011. Proceedings, volume 7074 of Lecture Notes in Computer Science, pages 170-179. Springer, 2011. URL: https://doi.org/10.1007/978-3-642-25591-5_19.
  15. Kunihiko Sadakane. Compressed suffix trees with full functionality. Theory Comput. Syst., 41(4):589-607, 2007. URL: https://doi.org/10.1007/s00224-006-1198-x.
  16. Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249-260, 1995. URL: https://doi.org/10.1007/BF01206331.
  17. Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory, Iowa City, Iowa, USA, October 15-17, 1973, pages 1-11. IEEE Computer Society, 1973. URL: https://doi.org/10.1109/SWAT.1973.13.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail