Efficient Tree-Structured Categorical Retrieval

Authors Djamal Belazzougui, Gregory Kucherov



PDF
Thumbnail PDF

File

LIPIcs.CPM.2020.4.pdf
  • Filesize: 444 kB
  • 11 pages

Document Identifiers

Author Details

Djamal Belazzougui
  • CAPA, DTISI, Centre de Recherche sur l'Information Scientifique et Technique, Algiers, Algeria
Gregory Kucherov
  • CNRS and LIGM/Univ Gustave Eiffel, Marne-la-Vallée, France
  • Skolkovo Institute of Science and Technology, Moscow, Russia

Cite AsGet BibTex

Djamal Belazzougui and Gregory Kucherov. Efficient Tree-Structured Categorical Retrieval. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 4:1-4:11, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)
https://doi.org/10.4230/LIPIcs.CPM.2020.4

Abstract

We study a document retrieval problem in the new framework where D text documents are organized in a category tree with a pre-defined number h of categories. This situation occurs e.g. with taxomonic trees in biology or subject classification systems for scientific literature. Given a string pattern p and a category (level in the category tree), we wish to efficiently retrieve the t categorical units containing this pattern and belonging to the category. We propose several efficient solutions for this problem. One of them uses n(logσ(1+o(1))+log D+O(h)) + O(Δ) bits of space and O(|p|+t) query time, where n is the total length of the documents, σ the size of the alphabet used in the documents and Δ is the total number of nodes in the category tree. Another solution uses n(logσ(1+o(1))+O(log D))+O(Δ)+O(Dlog n) bits of space and O(|p|+tlog D) query time. We finally propose other solutions which are more space-efficient at the expense of a slight increase in query time.

Subject Classification

ACM Subject Classification
  • Theory of computation → Pattern matching
  • Information systems → Document representation
  • Information systems → Information retrieval query processing
  • Theory of computation → Data structures design and analysis
Keywords
  • pattern matching
  • document retrieval
  • category tree
  • space-efficient data structures

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Djamal Belazzougui and Gonzalo Navarro. Alphabet-independent compressed text indexing. ACM Transactions on Algorithms (TALG), 10(4):23, 2014. Google Scholar
  2. Djamal Belazzougui, Gonzalo Navarro, and Daniel Valenzuela. Improved compressed indexes for full-text document retrieval. Journal of Discrete Algorithms (JDA), 18:3-13, 2013. Google Scholar
  3. Bernard Chazelle. A functional approach to data structures and its use in multidimensional searching. SIAM Journal on Computing (SICOMP), 17(3):427-462, 1988. Google Scholar
  4. Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM (JACM), 52(4):552-581, 2005. Google Scholar
  5. Johannes Fischer and Volker Heun. Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM Journal on Computing (SICOMP), 40(2):465-492, 2011. Google Scholar
  6. Luca Foschini, Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. When indexing equals compression: Experiments with compressing suffix arrays and applications. ACM Transactions on Algorithms (TALG), 2(4):611-639, 2006. Google Scholar
  7. Travis Gagie, Simon J Puglisi, and Andrew Turpin. Range quantile queries: Another virtue of wavelet trees. In Proceedings of the 16th International Symposium on String Processing and Information Retrieval (SPIRE), pages 1-6. Springer, 2009. Google Scholar
  8. Pawel Gawrychowski, Gregory Kucherov, Yakov Nekrich, and Tatiana Starikovskaya. Minimal discriminating words problem revisited. In Proceedings of the 20th International Symposium on String Processing and Information Retrieval (SPIRE), volume 8214 of Lecture Notes in Computer Science, pages 129-140. Springer, 2013. Google Scholar
  9. Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. High-order entropy-compressed text indexes. In Proceedings of the 14th annual ACM-SIAM symposium on Discrete algorithms (SODA), pages 841-850. Society for Industrial and Applied Mathematics, 2003. Google Scholar
  10. Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing (SICOMP), 35(2):378-407, 2005. Google Scholar
  11. Dov Harel and Robert Endre Tarjan. Fast algorithms for finding nearest common ancestors. SIAM Journal on Computing (SICOMP), 13(2):338-355, 1984. Google Scholar
  12. Wing-Kai Hon, Rahul Shah, Sharma V Thankachan, and Jeffrey Scott Vitter. Space-efficient frameworks for top-k string retrieval. Journal of the ACM (JACM), 61(2):1-36, 2014. Google Scholar
  13. G. Jacobson. Space-efficient static trees and graphs. In Proceedings of the 30th Annual Symposium on Foundations of Computer Science (FOCS), Research Triangle Park, North Carolina, USA, 30 October - 1 November 1989, pages 549-554. IEEE Computer Society, 1989. URL: https://doi.org/10.1109/SFCS.1989.63533.
  14. Gregory Kucherov, Yakov Nekrich, and Tatiana Starikovskaya. Computing discriminating and generic words. In L. Calderón-Benavides, C.N. González-Caro, E. Chávez, and N. Ziviani, editors, Proceedings of the 19th International Symposium on String Processing and Information Retrieval (SPIRE), volume 7608 of Lecture Notes in Computer Science, pages 307-317. Springer Verlag, 2012. Google Scholar
  15. Udi Manber and Gene Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing (SICOMP), 22(5):935-948, 1993. Google Scholar
  16. S. Muthu Muthukrishnan. Efficient algorithms for document retrieval problems. In Proceedings of the 13th annual ACM-SIAM symposium on Discrete algorithms (SODA), pages 657-666. Society for Industrial and Applied Mathematics, 2002. Google Scholar
  17. Gonzalo Navarro. Wavelet trees for all. Journal of Discrete Algorithms (JDA), 25:2-20, 2014. Google Scholar
  18. Gonzalo Navarro. Compact data structures: a practical approach. University of Cambridge, New York, NY, 2016. Google Scholar
  19. Gonzalo Navarro and Kunihiko Sadakane. Fully functional static and dynamic succinct trees. ACM Transactions on Algorithms (TALG), 10(3):16, 2014. Google Scholar
  20. Kunihiko Sadakane. Succinct data structures for flexible text retrieval systems. Journal of Discrete Algorithms (JDA), 5(1):12-22, 2007. Google Scholar
  21. Daniel D Sleator and Robert Endre Tarjan. A data structure for dynamic trees. Journal of computer and system sciences (JCSS), 26(3):362-391, 1983. Google Scholar
  22. Peter Weiner. Linear pattern matching algorithms. In Proceedings of the 14th Annual Symposium on Switching and Automata Theory, pages 1-11. IEEE, 1973. Google Scholar
  23. Derrick E. Wood and Steven L. Salzberg. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology, 15(3):R46, March 2014. URL: https://doi.org/10.1186/gb-2014-15-3-r46.