Document Retrieval Hacks

Authors Simon J. Puglisi, Bella Zhukova



PDF
Thumbnail PDF

File

LIPIcs.SEA.2021.12.pdf
  • Filesize: 2.08 MB
  • 12 pages

Document Identifiers

Author Details

Simon J. Puglisi
  • Department of Computer Science, University of Helsinki, Finland
Bella Zhukova
  • Department of Computer Science, University of Helsinki, Finland

Acknowledgements

Our thanks go to Dustin Cobas for prompt help in getting his codebase to compile on our system, and to Massimiliano Rossi for assistance with datasets.

Cite AsGet BibTex

Simon J. Puglisi and Bella Zhukova. Document Retrieval Hacks. In 19th International Symposium on Experimental Algorithms (SEA 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 190, pp. 12:1-12:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)
https://doi.org/10.4230/LIPIcs.SEA.2021.12

Abstract

Given a collection of strings, document listing refers to the problem of finding all the strings (or documents) where a given query string (or pattern) appears. Index data structures that support efficient document listing for string collections have been the focus of intense research in the last decade, with dozens of papers published describing exotic and elegant compressed data structures. The problem is now quite well understood in theory and many of the solutions have been implemented and evaluated experimentally. A particular recent focus has been on highly repetitive document collections, which have become prevalent in many areas (such as version control systems and genomics - to name just two very different sources). The aim of this paper is to describe simple and efficient document listing algorithms that can be used in combination with more sophisticated techniques, or as baselines against which the performance of new document listing indexes can be measured. Our approaches are based on simple combinations of scanning and hashing, which we show to combine very well with dictionary compression to achieve small space usage. Our experiments show these methods to be often much faster and less space consuming than the best specialized indexes for the problem.

Subject Classification

ACM Subject Classification
  • Information systems → Data compression
Keywords
  • String Processing
  • Pattern matching
  • Document listing
  • Document retrieval
  • Succinct data structures
  • Repetitive text collections

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. M. Charikar, E. Lehman, D. Liu, R. Panigrahy, M. Prabhakaran, A. Sahai, and A. Shelat. The smallest grammar problem. IEEE Trans. Inf. Theory, 51(7):2554-2576, 2005. Google Scholar
  2. F. Claude and I. Munro. Document listing on versioned documents. In Proc. SPIRE, LNCS 8214, pages 72-83, 2013. Google Scholar
  3. D. Cobas, V. Mäkinen, and M. Rossi. Tailoring r-index for document listing towards metagenomics applications. In Proc. SPIRE, LNCS 12303, pages 291-306. Springer, 2020. Google Scholar
  4. D. Cobas and G. Navarro. Fast, small, and simple document listing on repetitive text collections. In Proc. SPIRE, LNCS 11811, pages 482-498, 2019. Google Scholar
  5. T. Gagie, A. Hartikainen, K. Karhu, J. Kärkkäinen, G. Navarro, S. J. Puglisi, and J. Sirén. Document retrieval on repetitive collections. Information Retrieval, 20:253-291, 2017. Google Scholar
  6. T. Gagie, S. J. Puglisi, and D. Valenzuela. Analyzing relative Lempel-Ziv reference construction. In Proc. SPIRE, LNCS 9954, pages 160-165, 2016. Google Scholar
  7. S. Gog, R. Konow, and G. Navarro. Practical compact indexes for top-k document retrieval. ACM Journal of Experimental Algorithmics, 22(1):article 1.2, 2017. Google Scholar
  8. W.-K. Hon, R. Shah, and J. Vitter. Space-efficient framework for top-k string retrieval problems. In Proc. FOCS, pages 713-722. IEEE, 2009. Google Scholar
  9. C. Hoobin, S. J. Puglisi, and J. Zobel. Relative Lempel-Ziv factorization for efficient storage and retrieval of web collections. Proceedings of the VLDB Endowment, 5(3):265-273, 2011. Google Scholar
  10. S. Kuruppu, S. J. Puglisi, and J. Zobel. Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval. In Proc. SPIRE, LNCS 6393, pages 201-206, 2010. Google Scholar
  11. K. Liao, M. Petri, A. Moffat, and A. Wirth. Effective construction of relative lempel-ziv dictionaries. In Proc. WWW, pages 807-816. ACM, 2016. Google Scholar
  12. V. Mäkinen, G. Navarro, J. Sirén, and N. Välimäki. Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology, 17(3):281-308, 2010. Google Scholar
  13. U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 22(5):935-948, 1993. Google Scholar
  14. S. Muthukrishnan. Efficient algorithms for document retrieval problems. In Proc 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 657-666, 2002. Google Scholar
  15. G. Navarro. Spaces, trees and colors: The algorithmic landscape of document retrieval on sequences. ACM Computing Surveys, 46(4):article 52, 2014. Google Scholar
  16. G. Navarro. Document listing on repetitive collections with guaranteed performance. Theoretical Computer Science, 777:58-72, 2019. Google Scholar
  17. G. Navarro. Indexing highly repetitive string collections, part I: Repetitiveness measures. ACM Computing Surveys, 2020. To appear. Google Scholar
  18. G. Navarro. Indexing highly repetitive string collections, part II: Compressed indexes. ACM Computing Surveys, 54(2):article 26, 2021. Google Scholar
  19. G. Navarro, S. J. Puglisi, and D. Valenzuela. General document retrieval in compact space. ACM Journal of Experimental Algorithmics, 19(2):article 3, 2014. Google Scholar
  20. S. J. Puglisi and B. Zhukova. Relative Lempel-Ziv compression of suffix arrays. In Proc. SPIRE, LNCS 12303, pages 89-96. Springer, 2020. Google Scholar
  21. S. J. Puglisi and B. Zhukova. Smaller RLZ-compressed suffix arrays. In Proc. Data Compression Conference, pages 213-222. IEEE Computer Society, 2021. Google Scholar
  22. K. Sadakane. Succinct data structures for flexible text retrieval systems. Journal of Discrete Algorithms, 5:12-22, 2007. Google Scholar
  23. J. Tong, A. Wirth, and J. Zobel. Principled dictionary pruning for low-memory corpus compression. In Proc. SIGIR, pages 283-292. ACM, 2014. Google Scholar
  24. J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 23(3):337-343, 1977. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail