The Quantile Index - Succinct Self-Index for Top-k Document Retrieval

Authors Niklas Baumstark, Simon Gog, Tobias Heuer, Julian Labeit



PDF
Thumbnail PDF

File

LIPIcs.SEA.2017.15.pdf
  • Filesize: 0.59 MB
  • 14 pages

Document Identifiers

Author Details

Niklas Baumstark
Simon Gog
Tobias Heuer
Julian Labeit

Cite As Get BibTex

Niklas Baumstark, Simon Gog, Tobias Heuer, and Julian Labeit. The Quantile Index - Succinct Self-Index for Top-k Document Retrieval. In 16th International Symposium on Experimental Algorithms (SEA 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 75, pp. 15:1-15:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017) https://doi.org/10.4230/LIPIcs.SEA.2017.15

Abstract

One of the central problems in information retrieval is that of finding the k documents in a large text collection that best match a query given by a user. A recent result of Navarro & Nekrich (SODA 2012) showed that single term and phrase queries of length m can be solved in optimal O(m+k) time using a linear word sized index. While a verbatim implementation of the index would be at least an order of magnitude larger than the original collection, various authors incrementally improved the index to a point where the space requirement is currently within a factor of 1.5 to 2.0 of the text size for standard collections.

In this paper, we propose a new time/space trade-off for different top-k indexes. This is achieved by sampling only a quantile of the postings in the original inverted file or suffix array-based index. For those queries that cannot be answered using the sampled version of the index we show how to compute the query results on the fly efficiently. As an example, we apply our method to the top-k framework by Navarro & Nekrich. Under probabilistic assumptions that hold for most standard texts, and for a standard scoring function called term frequency, our index can be represented with only sublinearly many bits plus the space needed for a compressed suffix array of the text, while maintaining poly-logarithmic query times. We evaluate our solution on real-world datasets and compare its practical space usage and performance against state-of-the-art implementations. Our experiments show that our index compresses below the size of the original text. To our knowledge it is the first suffix array-based text index that is able to break this bound in practice even for non-repetitive collections, while still maintaining reasonable query times of under half a millisecond on average for top-10 queries.

Subject Classification

Keywords
  • Text Indexing
  • Succinct Data Structures
  • Top-k Document Retrieval

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. M. Blum, R. W. Floyd, V. Pratt, R. L. Rivest, and R. E. Tarjan. Time bounds for selection. Journal of Computer and System Sciences, pages 448-461, 1973. Google Scholar
  2. N. Brisaboa, G. de Bernardo, R. Konow, and G. Navarro. K²-Treaps: Range Top-k Queries in Compact Space. In Proc. SPIRE, pages 215-226, 2014. Google Scholar
  3. J. S. Culpepper, G. Navarro, S. J. Puglisi, and A. Turpin. Top-k ranked document search in general text databases. In Proc. ESA, pages 194-205, 2010. Google Scholar
  4. T. Gagie, A. Hartikainen, K. Karhu, J. Kärkkäinen, G. Navarro, S. J. Puglisi, and J. Sirén. Document retrieval on repetitive collections. Information Retrieval, 2017. To appear. Google Scholar
  5. T. Gagie, K. Karhu, G. Navarro, S. J. Puglisi, and J. Sirén. Document listing on repetitive collections. In Proc. CPM, pages 107-119, 2013. Google Scholar
  6. S. Gog, T. Beller, A. Moffat, and M. Petri. From theory to practice: Plug and play with succinct data structures. In Proc. SEA, pages 326-337, 2014. Google Scholar
  7. S. Gog, R. Konow, and G. Navarro. Practical compact indexes for top-k document retrieval. J. Experimental Alg., 22(1):article 1.2, 2017. Google Scholar
  8. S. Gog and G. Navarro. Improved single-term top-k document retrieval. In Proc. ALENEX, pages 24-32, 2015. Google Scholar
  9. W.-K. Hon, R. Shah, and J. S. Vitter. Space-efficient framework for top-k string retrieval problems. In Proc. FOCS, pages 713-722, 2009. Google Scholar
  10. R. Konow and G. Navarro. Faster compact top-k document retrieval. In Proc. DCC, pages 5-17, 2013. Google Scholar
  11. J. Labeit and S. Gog. Elias-fano meets single-term top-k document retrieval. In Proc. ALENEX, 2017. Google Scholar
  12. Y. Matias, S. Muthukrishnan, S. C. Sahinalp, and J. Ziv. Augmenting suffix trees, with applications. In Proc. ESA, pages 67-78. Springer, 1998. Google Scholar
  13. S. Muthukrishnan. Efficient algorithms for document retrieval problems. In Proc. SODA, pages 657-666, 2002. Google Scholar
  14. G. Navarro. Spaces, trees and colors: The algorithmic landscape of document retrieval on sequences. ACM Comp. Surv., 46(4):article 52, 2014. Google Scholar
  15. G. Navarro and V. Mäkinen. Compressed full-text indexes. ACM Comp. Surv., 39(1), 2007. Article 2. Google Scholar
  16. G. Navarro and Y. Nekrich. Top-k document retrieval in optimal time and linear space. In Proc. SODA, pages 1066-1078, 2012. Google Scholar
  17. G. Navarro and S. Thankachan. Faster top-k document retrieval in optimal space. In Proc. SPIRE, pages 255-262, 2013. Google Scholar
  18. K. Sadakane. Succinct data structures for flexible text retrieval systems. J. Discrete Alg., 5(1):12-22, 2007. Google Scholar
  19. W. Szpankowski. A generalized suffix tree and its (un) expected asymptotic behaviors. SIAM Journal on Computing, 22(6):1176-1198, 1993. Google Scholar
  20. The Apache Software Foundation. Apache Lucene. URL: http://lucene.apache.org/.
  21. D. Tsur. Top-k document retrieval in optimal space. Information Processing Letters, 113(12):440-443, 2013. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail