Document Listing on Repetitive Collections with Guaranteed Performance

Author Gonzalo Navarro



PDF
Thumbnail PDF

File

LIPIcs.CPM.2017.4.pdf
  • Filesize: 484 kB
  • 13 pages

Document Identifiers

Author Details

Gonzalo Navarro

Cite As Get BibTex

Gonzalo Navarro. Document Listing on Repetitive Collections with Guaranteed Performance. In 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 78, pp. 4:1-4:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017) https://doi.org/10.4230/LIPIcs.CPM.2017.4

Abstract

We consider document listing on string collections, that is, finding in which strings a given pattern appears. In particular, we focus on repetitive collections: a collection of size N over alphabet [1,a] is composed of D copies of a string of size n, and s single-character edits are applied on the copies. We introduce the first document listing index with size O~(n + s), precisely O((n lg a + s lg^2 N) lg D) bits, and with useful worst-case time guarantees: Given a pattern of length m, the index reports the ndoc strings where it appears in time O(m^2 + m lg N (lg D + lg^e N) ndoc), for any constant e > 0.

Subject Classification

Keywords
  • repetitive string collections
  • document listing
  • grammar compression
  • range minimum queries
  • succinct data structures

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Ricardo A. Baeza-Yates and Berthier A. Ribeiro-Neto. Modern Information Retrieval: The Concepts and Technology behind Search. Addison-Wesley Professional, 2011. URL: http://www.mir2ed.org/.
  2. Jérémy Barbay, Johannes Fischer, and Gonzalo Navarro. LRM-trees: Compressed indices, adaptive sorting, and compressed permutations. Theor. Comput. Sci., 459:26-41, 2012. URL: http://dx.doi.org/10.1016/j.tcs.2012.08.010.
  3. Philip Bille, Gad M. Landau, Rajeev Raman, Kunihiko Sadakane, Srinivasa Rao Satti, and Oren Weimann. Random access to grammar-compressed strings and trees. SIAM J. Comput., 44(3):513-539, 2015. URL: http://dx.doi.org/10.1137/130936889.
  4. Stefan Büttcher, Charles L. A. Clarke, and Gordon V. Cormack. Information Retrieval: Implementing and Evaluating Search Engines. MIT Press, 2010. URL: http://mitpress.mit.edu/books/information-retrieval.
  5. Moses Charikar, Eric Lehman, Ding Liu, Rina Panigrahy, Manoj Prabhakaran, Amit Sahai, and Abhi Shelat. The smallest grammar problem. IEEE Trans. Inf. Theory, 51(7):2554-2576, 2005. URL: http://dx.doi.org/10.1109/TIT.2005.850116.
  6. Bernard Chazelle. A functional approach to data structures and its use in multidimensional searching. SIAM J. Comput., 17(3):427-462, 1988. URL: http://dx.doi.org/10.1137/0217026.
  7. David R. Clark. Compact PAT Trees. PhD thesis, University of Waterloo, Canada, 1996. URL: http://hdl.handle.net/10012/64.
  8. Francisco Claude and J. Ian Munro. Document listing on versioned documents. In Oren Kurland, Moshe Lewenstein, and Ely Porat, editors, Proceedings of the 20th International Symposium on String Processing and Information Retrieval (SPIRE 2013), volume 8214 of LNCS, pages 72-83. Springer, 2013. URL: http://dx.doi.org/10.1007/978-3-319-02432-5_12.
  9. Francisco Claude and Gonzalo Navarro. Self-indexed grammar-based compression. Fundam. Inform., 111(3):313-337, 2010. URL: http://dx.doi.org/10.3233/FI-2011-565.
  10. Francisco Claude and Gonzalo Navarro. Improved grammar-based compressed indexes. In Liliana Calderón-Benavides, Cristina N. González-Caro, Edgar Chávez, and Nivio Ziviani, editors, Proceedings of the 19th International Symposium on String Processing and Information Retrieval (SPIRE 2012), volume 7608 of LNCS, pages 180-192. Springer, 2012. URL: http://dx.doi.org/10.1007/978-3-642-34109-0_19.
  11. Johannes Fischer and Volker Heun. Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM J. Comput., 40(2):465-492, 2011. URL: http://dx.doi.org/10.1137/090779759.
  12. Travis Gagie, Kalle Karhu, Gonzalo Navarro, Simon J. Puglisi, and Jouni Sirén. Document listing on repetitive collections. In Johannes Fischer and Peter Sanders, editors, Proceedings of the 24th Annual Symposium on Combinatorial Pattern Matching (CPM 2013), volume 7922 of LNCS, pages 107-119. Springer, 2013. URL: http://dx.doi.org/10.1007/978-3-642-38905-4_12.
  13. Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. High-order entropy-compressed text indexes. In Martin Farach-Colton, editor, Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2003), pages 841-850. ACM/SIAM, 2003. URL: http://dl.acm.org/citation.cfm?id=644108.644250.
  14. Leszek Gąsieniec, Roman M. Kolpakov, Igor Potapov, and Paul Sant. Real-time traversal in grammar-based compressed files. In James A. Storer and Martin Cohn, editors, Proceedings of the 2005 Data Compression Conference (DCC 2005), page 458. IEEE Computer Society, 2005. URL: http://dx.doi.org/10.1109/DCC.2005.78.
  15. Wing-Kai Hon, Rahul Shah, and Jeffrey Scott Vitter. Space-efficient framework for top-k string retrieval problems. In Daniel A. Spielman, editor, Proceedings of the 50th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2009), pages 713-722. IEEE Computer Society, 2009. URL: http://dx.doi.org/10.1109/FOCS.2009.19.
  16. Danny Hucke, Markus Lohrey, and Carl Philipp Reh. The smallest grammar problem revisited. In Shunsuke Inenaga, Kunihiko Sadakane, and Tetsuya Sakai, editors, Proceedings of the 23rd International Symposium on String Processing and Information Retrieval (SPIRE 2016), volume 9954 of LNCS, pages 35-49. Springer, 2016. URL: http://dx.doi.org/10.1007/978-3-319-46049-9_4.
  17. Artur Jeż. Approximation of grammar-based compression via recompression. Theor. Comput. Sci., 592:115-134, 2015. URL: http://dx.doi.org/10.1016/j.tcs.2015.05.027.
  18. Artur Jeż. A really simple approximation of smallest grammar. Theor. Comput. Sci., 616:141-150, 2016. URL: http://dx.doi.org/10.1016/j.tcs.2015.12.032.
  19. Sebastian Kreft and Gonzalo Navarro. On compressing and indexing repetitive sequences. Theor. Comput. Sci., 483:115-133, 2013. URL: http://dx.doi.org/10.1016/j.tcs.2012.02.006.
  20. Abraham Lempel and Jacob Ziv. On the complexity of finite sequences. IEEE Trans. Inf. Theory, 22(1):75-81, 1976. URL: http://dx.doi.org/10.1109/TIT.1976.1055501.
  21. Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol., 17(3):281-308, 2010. URL: http://dx.doi.org/10.1089/cmb.2009.0169.
  22. S. Muthukrishnan. Efficient algorithms for document retrieval problems. In David Eppstein, editor, Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2002), pages 657-666. ACM/SIAM, 2002. URL: http://dl.acm.org/citation.cfm?id=545381.545469.
  23. Gonzalo Navarro. Indexing highly repetitive collections. In S. Arumugam and W. F. Smyth, editors, Proceedings of the 23rd International Workshop on Combinatorial Algorithms (IWOCA 2012), volume 7643 of LNCS, pages 274-279. Springer, 2012. URL: http://dx.doi.org/10.1007/978-3-642-35926-2_29.
  24. Gonzalo Navarro. Spaces, trees, and colors: The algorithmic landscape of document retrieval on sequences. ACM Comput. Surv., 46(4):52:1-52:47, 2014. URL: http://dx.doi.org/10.1145/2535933.
  25. Gonzalo Navarro. Wavelet trees for all. J. Discrete Algorithms, 25:2-20, 2014. URL: http://dx.doi.org/10.1016/j.jda.2013.07.004.
  26. Gonzalo Navarro. Compact Data Structures: A practical approach. Cambridge University Press, 2016. URL: http://dx.doi.org/10.1017/CBO9781316588284.
  27. Daisuke Okanohara and Kunihiko Sadakane. Practical entropy-compressed rank/select dictionary. In David Applegate and Gerth Stølting Brodal, editors, Proceedings of the 9th Workshop on Algorithm Engineering and Experiments (ALENEX 2007). SIAM, 2007. URL: http://dx.doi.org/10.1137/1.9781611972870.6.
  28. Wojciech Rytter. Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci., 302(1-3):211-222, 2003. URL: http://dx.doi.org/10.1016/S0304-3975(02)00777-6.
  29. Kunihiko Sadakane. Succinct data structures for flexible text retrieval systems. J. Discrete Algorithms, 5(1):12-22, 2007. URL: http://dx.doi.org/10.1016/j.jda.2006.03.011.
  30. Hiroshi Sakamoto. A fully linear-time approximation algorithm for grammar-based compression. J. Discrete Algorithms, 3(2-4):416-430, 2005. URL: http://dx.doi.org/10.1016/j.jda.2004.08.016.
  31. Elad Verbin and Wei Yu. Data structure lower bounds on random access to grammar-compressed strings. In Johannes Fischer and Peter Sanders, editors, Proceedings of the 24th Annual Symposium on Combinatorial Pattern Matching (CPM 2013), volume 7922 of LNCS, pages 247-258. Springer, 2013. URL: http://dx.doi.org/10.1007/978-3-642-38905-4_24.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail