Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2020-06-09
10.4230/LIPIcs.CPM.2020.4
Efficient Tree-Structured Categorical Retrieval
Belazzougui, Djamal
1
Kucherov, Gregory
2
3
https://orcid.org/0000-0001-5899-5424
CAPA, DTISI, Centre de Recherche sur l'Information Scientifique et Technique, Algiers, Algeria
CNRS and LIGM/Univ Gustave Eiffel, Marne-la-Vallée, France
Skolkovo Institute of Science and Technology, Moscow, Russia
We study a document retrieval problem in the new framework where D text documents are organized in a category tree with a pre-defined number h of categories. This situation occurs e.g. with taxomonic trees in biology or subject classification systems for scientific literature. Given a string pattern p and a category (level in the category tree), we wish to efficiently retrieve the t categorical units containing this pattern and belonging to the category. We propose several efficient solutions for this problem. One of them uses n(logσ(1+o(1))+log D+O(h)) + O(Δ) bits of space and O(|p|+t) query time, where n is the total length of the documents, σ the size of the alphabet used in the documents and Δ is the total number of nodes in the category tree. Another solution uses n(logσ(1+o(1))+O(log D))+O(Δ)+O(Dlog n) bits of space and O(|p|+tlog D) query time. We finally propose other solutions which are more space-efficient at the expense of a slight increase in query time.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol161-cpm2020/LIPIcs.CPM.2020.4/LIPIcs.CPM.2020.4.pdf
pattern matching
document retrieval
category tree
space-efficient data structures