Compressed Multiple Pattern Matching

Authors Dmitry Kosolobov, Nikita Sivukhin



PDF
Thumbnail PDF

File

LIPIcs.CPM.2019.13.pdf
  • Filesize: 0.78 MB
  • 14 pages

Document Identifiers

Author Details

Dmitry Kosolobov
  • University of Helsinki, Helsinki, Finland
Nikita Sivukhin
  • Ural Federal University, Ekaterinburg, Russia

Cite AsGet BibTex

Dmitry Kosolobov and Nikita Sivukhin. Compressed Multiple Pattern Matching. In 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 128, pp. 13:1-13:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)
https://doi.org/10.4230/LIPIcs.CPM.2019.13

Abstract

Given d strings over the alphabet {0,1,...,sigma{-}1}, the classical Aho - Corasick data structure allows us to find all occ occurrences of the strings in any text T in O(|T| + occ) time using O(m log m) bits of space, where m is the number of edges in the trie containing the strings. Fix any constant epsilon in (0, 2). We describe a compressed solution for the problem that, provided sigma <=m^delta for a constant delta < 1, works in O(|T| 1/epsilon log(1/epsilon) + occ) time, which is O(|T| + occ) since epsilon is constant, and occupies mH_k + 1.443 m + epsilon m + O(d log m/d) bits of space, for all 0 <= k <= max{0,alpha log_sigma m - 2} simultaneously, where alpha in (0,1) is an arbitrary constant and H_k is the kth-order empirical entropy of the trie. Hence, we reduce the 3.443m term in the space bounds of previously best succinct solutions to (1.443 + epsilon)m, thus solving an open problem posed by Belazzougui. Further, we notice that L = log binom{sigma (m+1)}{m} - O(log(sigma m)) is a worst-case space lower bound for any solution of the problem and, for d = o(m) and constant epsilon, our approach allows to achieve L + epsilon m bits of space, which gives an evidence that, for d = o(m), the space of our data structure is theoretically optimal up to the epsilon m additive term and it is hardly possible to eliminate the term 1.443m. In addition, we refine the space analysis of previous works by proposing a more appropriate definition for H_k. We also simplify the construction for practice adapting the fixed block compression boosting technique, then implement our data structure, and conduct a number of experiments showing that it is comparable to the state of the art in terms of time and is superior in space.

Subject Classification

ACM Subject Classification
  • Theory of computation → Pattern matching
Keywords
  • multiple pattern matching
  • compressed space
  • Aho--Corasick automaton

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Alfred V. Aho and Margaret J. Corasick. Efficient string matching: an aid to bibliographic search. Communications of the ACM, 18(6):333-340, 1975. URL: http://dx.doi.org/10.1145/360825.360855.
  2. Jarno Alanko and Tuukka Norri. Greedy shortest common superstring approximation in compact space. In Proc. SPIRE, volume 10508 of LNCS, pages 1-13. Springer, 2017. URL: http://dx.doi.org/10.1007/978-3-319-67428-5_1.
  3. Djamal Belazzougui. Succinct dictionary matching with no slowdown. In Proc. CPM, volume 6129 of LNCS, pages 88-100. Springer, 2010. URL: http://dx.doi.org/10.1007/978-3-642-13509-5_9.
  4. Michael Burrows and David J. Wheeler. A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, Palo Alto, California, 1994. Google Scholar
  5. Ho-Leung Chan, Wing-Kai Hon, Tak-Wah Lam, and Kunihiko Sadakane. Dynamic dictionary matching and compressed suffix trees. In Proc. SODA, pages 13-22. SIAM, 2005. Google Scholar
  6. David Clark. Compact pat trees. PhD thesis, University of Waterloo, 1997. Google Scholar
  7. Raphaël Clifford, Allyx Fontaine, Ely Porat, Benjamin Sach, and Tatiana Starikovskaya. Dictionary matching in a stream. In Proc. ESA, volume 9294 of LNCS, pages 361-372. Springer, 2015. URL: http://dx.doi.org/10.1007/978-3-662-48350-3_31.
  8. Maxime Crochemore and Wojciech Rytter. Jewels of stringology. World Scientific Publishing Co. Pte. Ltd., 2002. Google Scholar
  9. Vassilis Dimopoulos, Ioannis Papaefstathiou, and Dionisios Pnevmatikatos. A memory-efficient reconfigurable Aho-Corasick FSM implementation for intrusion detection systems. In Proc. IC-SAMOS, pages 186-193. IEEE, 2007. URL: http://dx.doi.org/10.1109/ICSAMOS.2007.4285750.
  10. Guy Feigenblat, Ely Porat, and Ariel Shiftan. Linear time succinct indexable dictionary construction with applications. In Proc. DCC, 2016, pages 13-22. IEEE, 2016. URL: http://dx.doi.org/10.1109/DCC.2016.70.
  11. Guy Feigenblat, Ely Porat, and Ariel Shiftan. A grouping approach for succinct dynamic dictionary matching. Algorithmica, 77(1):134-150, 2017. URL: http://dx.doi.org/10.1007/s00453-015-0056-0.
  12. Paolo Ferragina, Raffaele Giancarlo, Giovanni Manzini, and Marinella Sciortino. Boosting textual compression in optimal linear time. Journal of the ACM, 52(4):688-713, 2005. URL: http://dx.doi.org/10.1145/1082036.1082043.
  13. Paolo Ferragina, Fabrizio Luccio, Giovanni Manzini, and S. Muthukrishnan. Structuring labeled trees for optimal succinctness, and beyond. In Proc. FOCS, pages 184-193. IEEE, 2005. URL: http://dx.doi.org/10.1109/SFCS.2005.69.
  14. Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Proc. FOCS, pages 390-398. IEEE, 2000. URL: http://dx.doi.org/10.1109/SFCS.2000.892127.
  15. Travis Gagie. Large alphabets and incompressibility. Information Processing Letters, 99(6):246-251, 2006. URL: http://dx.doi.org/10.1016/j.ipl.2006.04.008.
  16. Simon Gog. SDSL - succinct data structure library, 2015. URL: https://github.com/simongog/sdsl-lite.
  17. Simon Gog, Juha Kärkkäinen, Dominik Kempa, Matthias Petri, and Simon J. Puglisi. Faster, minuter. In Proc. DCC, pages 53-62. IEEE, 2016. URL: http://dx.doi.org/10.1109/DCC.2016.94.
  18. Shay Golan and Ely Porat. Real-time streaming multi-pattern search for constant alphabet. In Proc. ESA, volume 87 of LIPIcs, pages 41:1-41:15. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017. URL: http://dx.doi.org/10.4230/LIPIcs.ESA.2017.41.
  19. Ronald L. Graham, Donald E. Knuth, and Oren Patashnik. Concrete Mathematics. Massachusetts: Addison-Wesley, 1989. Google Scholar
  20. Dan Gusfield. Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge university press, 1997. URL: http://dx.doi.org/10.1017/CBO9780511574931.
  21. Wing-Kai Hon, Tsung-Han Ku, Rahul Shah, Sharma V. Thankachan, and Jeffrey S. Vitter. Faster compressed dictionary matching. Theoretical Computer Science, 475:113-119, 2013. URL: http://dx.doi.org/10.1016/j.tcs.2012.10.050.
  22. Wing-Kai Hon, Tak-Wah Lam, Rahul Shah, Siu-Lung Tam, and Jeffrey S. Vitter. Compressed index for dictionary matching. In Proc. DCC, pages 23-32. IEEE, 2008. URL: http://dx.doi.org/10.1109/DCC.2008.62.
  23. Jesper Jansson, Kunihiko Sadakane, and Wing-Kin Sung. Ultra-succinct representation of ordered trees with applications. Journal of Computer and System Sciences, 78(2):619-631, 2012. URL: http://dx.doi.org/10.1016/j.jcss.2011.09.002.
  24. Juha Kärkkäinen and Simon J. Puglisi. Fixed block compression boosting in FM-indexes. In Proc. SPIRE, volume 7024 of LNCS, pages 174-184. Springer, 2011. URL: http://dx.doi.org/10.1007/978-3-642-24583-1_18.
  25. Tsvi Kopelowitz, Ely Porat, and Yaron Rozen. Succinct Online Dictionary Matching with Improved Worst-Case Guarantees. In Proc. CPM, volume 54 of LIPIcs. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2016. URL: http://dx.doi.org/10.4230/LIPIcs.CPM.2016.6.
  26. S. Rao Kosaraju and Giovanni Manzini. Compression of low entropy strings with Lempel-Ziv algorithms. SIAM Journal on Computing, 29(3):893-911, 1999. URL: http://dx.doi.org/10.1137/S0097539797331105.
  27. Hung-Jen Liao, Chun-Hung R. Lin, Ying-Chih Lin, and Kuang-Yuan Tung. Intrusion detection system: A comprehensive review. Journal of Network and Computer Applications, 36(1):16-24, 2013. URL: http://dx.doi.org/10.1016/j.jnca.2012.09.004.
  28. Veli Mäkinen and Gonzalo Navarro. Compressed full-text indexes. ACM Computing Surveys, 39(1):2, 2007. URL: http://dx.doi.org/10.1145/1216370.1216372.
  29. Giovanni Manzini. An analysis of the Burrows-Wheeler transform. Journal of the ACM, 48(3):407-430, 2001. URL: http://dx.doi.org/10.1145/382780.382782.
  30. Gonzalo Navarro. Compact data structures: A practical approach. Cambridge University Press, 2016. URL: http://dx.doi.org/10.1017/CBO9781316588284.
  31. Gonzalo Navarro and Kunihiko Sadakane. Fully functional static and dynamic succinct trees. ACM Transactions on Algorithms, 10(3):16, 2014. URL: http://dx.doi.org/10.1145/2601073.
  32. Derek Pao, Xing Wang, Xiaoran Wang, Cong Cao, and Yuesheng Zhu. String searching engine for virus scanning. IEEE Transactions on Computers, 60(11):1596-1609, 2011. URL: http://dx.doi.org/10.1109/TC.2010.250.
  33. Rajeev Raman, Venkatesh Raman, and S. Srinivasa Rao. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. ACM Transactions on Algorithms, 3(4):43, 2007. URL: http://dx.doi.org/10.1145/1290672.1290680.
  34. Dina Sokol and Marcus Shoshana. Engineering small space dictionary matching. arXiv preprint, 2013. URL: http://arxiv.org/abs/1301.6428.
  35. Alan Tam, Edward Wu, Tak-Wah Lam, and Siu-Ming Yiu. Succinct text indexing with wildcards. In Proc. SPIRE, pages 39-50. Springer, 2009. URL: http://dx.doi.org/10.1007/978-3-642-03784-9_5.
  36. Sun Wu and Udi Manber. Fast text searching: allowing errors. Communications of the ACM, 35(10):83-91, 1992. URL: http://dx.doi.org/10.1145/135239.135244.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail