Efficient Tree-Structured Categorical Retrieval

Belazzougui, Djamal; Kucherov, Gregory

doi:10.4230/LIPIcs.CPM.2020.4

File

Subject Classification

ACM Subject Classification

Theory of computation → Pattern matching
Information systems → Document representation
Information systems → Information retrieval query processing
Theory of computation → Data structures design and analysis

Keywords

pattern matching
document retrieval
category tree
space-efficient data structures

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

Abstract

We study a document retrieval problem in the new framework where D text documents are organized in a category tree with a pre-defined number h of categories. This situation occurs e.g. with taxomonic trees in biology or subject classification systems for scientific literature. Given a string pattern p and a category (level in the category tree), we wish to efficiently retrieve the t categorical units containing this pattern and belonging to the category. We propose several efficient solutions for this problem. One of them uses n(logσ(1+o(1))+log D+O(h)) + O(Δ) bits of space and O(|p|+t) query time, where n is the total length of the documents, σ the size of the alphabet used in the documents and Δ is the total number of nodes in the category tree. Another solution uses n(logσ(1+o(1))+O(log D))+O(Δ)+O(Dlog n) bits of space and O(|p|+tlog D) query time. We finally propose other solutions which are more space-efficient at the expense of a slight increase in query time.

Cite As Get BibTex

Djamal Belazzougui and Gregory Kucherov. Efficient Tree-Structured Categorical Retrieval. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 4:1-4:11, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020) https://doi.org/10.4230/LIPIcs.CPM.2020.4

Author Details

Djamal Belazzougui

CAPA, DTISI, Centre de Recherche sur l'Information Scientifique et Technique, Algiers, Algeria

Gregory Kucherov

CNRS and LIGM/Univ Gustave Eiffel, Marne-la-Vallée, France
Skolkovo Institute of Science and Technology, Moscow, Russia

References

Djamal Belazzougui and Gonzalo Navarro. Alphabet-independent compressed text indexing. ACM Transactions on Algorithms (TALG), 10(4):23, 2014.
Djamal Belazzougui, Gonzalo Navarro, and Daniel Valenzuela. Improved compressed indexes for full-text document retrieval. Journal of Discrete Algorithms (JDA), 18:3-13, 2013.
Bernard Chazelle. A functional approach to data structures and its use in multidimensional searching. SIAM Journal on Computing (SICOMP), 17(3):427-462, 1988.
Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM (JACM), 52(4):552-581, 2005.
Johannes Fischer and Volker Heun. Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM Journal on Computing (SICOMP), 40(2):465-492, 2011.
Luca Foschini, Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. When indexing equals compression: Experiments with compressing suffix arrays and applications. ACM Transactions on Algorithms (TALG), 2(4):611-639, 2006.
Travis Gagie, Simon J Puglisi, and Andrew Turpin. Range quantile queries: Another virtue of wavelet trees. In Proceedings of the 16th International Symposium on String Processing and Information Retrieval (SPIRE), pages 1-6. Springer, 2009.
Pawel Gawrychowski, Gregory Kucherov, Yakov Nekrich, and Tatiana Starikovskaya. Minimal discriminating words problem revisited. In Proceedings of the 20th International Symposium on String Processing and Information Retrieval (SPIRE), volume 8214 of Lecture Notes in Computer Science, pages 129-140. Springer, 2013.
Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. High-order entropy-compressed text indexes. In Proceedings of the 14th annual ACM-SIAM symposium on Discrete algorithms (SODA), pages 841-850. Society for Industrial and Applied Mathematics, 2003.
Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing (SICOMP), 35(2):378-407, 2005.
Dov Harel and Robert Endre Tarjan. Fast algorithms for finding nearest common ancestors. SIAM Journal on Computing (SICOMP), 13(2):338-355, 1984.
Wing-Kai Hon, Rahul Shah, Sharma V Thankachan, and Jeffrey Scott Vitter. Space-efficient frameworks for top-k string retrieval. Journal of the ACM (JACM), 61(2):1-36, 2014.
G. Jacobson. Space-efficient static trees and graphs. In Proceedings of the 30th Annual Symposium on Foundations of Computer Science (FOCS), Research Triangle Park, North Carolina, USA, 30 October - 1 November 1989, pages 549-554. IEEE Computer Society, 1989. URL: https://doi.org/10.1109/SFCS.1989.63533.
Gregory Kucherov, Yakov Nekrich, and Tatiana Starikovskaya. Computing discriminating and generic words. In L. Calderón-Benavides, C.N. González-Caro, E. Chávez, and N. Ziviani, editors, Proceedings of the 19th International Symposium on String Processing and Information Retrieval (SPIRE), volume 7608 of Lecture Notes in Computer Science, pages 307-317. Springer Verlag, 2012.
Udi Manber and Gene Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing (SICOMP), 22(5):935-948, 1993.
S. Muthu Muthukrishnan. Efficient algorithms for document retrieval problems. In Proceedings of the 13th annual ACM-SIAM symposium on Discrete algorithms (SODA), pages 657-666. Society for Industrial and Applied Mathematics, 2002.
Gonzalo Navarro. Wavelet trees for all. Journal of Discrete Algorithms (JDA), 25:2-20, 2014.
Gonzalo Navarro. Compact data structures: a practical approach. University of Cambridge, New York, NY, 2016.
Gonzalo Navarro and Kunihiko Sadakane. Fully functional static and dynamic succinct trees. ACM Transactions on Algorithms (TALG), 10(3):16, 2014.
Kunihiko Sadakane. Succinct data structures for flexible text retrieval systems. Journal of Discrete Algorithms (JDA), 5(1):12-22, 2007.
Daniel D Sleator and Robert Endre Tarjan. A data structure for dynamic trees. Journal of computer and system sciences (JCSS), 26(3):362-391, 1983.
Peter Weiner. Linear pattern matching algorithms. In Proceedings of the 14th Annual Symposium on Switching and Automata Theory, pages 1-11. IEEE, 1973.
Derrick E. Wood and Steven L. Salzberg. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology, 15(3):R46, March 2014. URL: https://doi.org/10.1186/gb-2014-15-3-r46.

Efficient Tree-Structured Categorical Retrieval

Authors Djamal Belazzougui, Gregory Kucherov

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

Efficient Tree-Structured Categorical Retrieval

Authors Djamal Belazzougui, Gregory Kucherov

File

Document Identifiers

Related Versions

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

References

Thanks for your feedback!

Could not send message