Forbidden Extension Queries

Biswas, Sudip; Ganguly, Arnab; Shah, Rahul; Thankachan, Sharma V.

doi:10.4230/LIPIcs.FSTTCS.2015.320

File

Subject Classification

Keywords

document retrieval
suffix trees
range queries
succinct data structure

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

Document

0

Metadata

Abstract

Document retrieval is one of the most fundamental problem in information retrieval. The objective is to retrieve all documents from a document collection that are relevant to an input pattern. Several variations of this problem such as ranked document retrieval, document listing with two patterns and forbidden patterns have been studied. We introduce the problem of document retrieval with forbidden extensions. Let D={T_1,T_2,...,T_D} be a collection of D string documents of n characters in total, and P^+ and P^- be two query patterns, where P^+ is a proper prefix of P^-. We call P^- as the forbidden extension of the included pattern P^+. A forbidden extension query < P^+,P^- > asks to report all occ documents in D that contains P^+ as a substring, but does not contain P^- as one. A top-k forbidden extension query < P^+,P^-,k > asks to report those k documents among the occ documents that are most relevant to P^+. We present a linear index (in words) with an O(|P^-| + occ) query time for the document listing problem. For the top-k version of the problem, we achieve the following results, when the relevance of a document is based on PageRank: - an O(n) space (in words) index with O(|P^-|log sigma+ k) query time, where sigma is the size of the alphabet from which characters in D are chosen. For constant alphabets, this yields an optimal query time of O(|P^-|+ k). - for any constant epsilon > 0, a |CSA| + |CSA^*| + Dlog frac{n}{D} + O(n) bits index with O(search(P)+ k cdot tsa cdot log ^{2+epsilon} n) query time, where search(P) is the time to find the suffix range of a pattern P, tsa is the time to find suffix (or inverse suffix) array value, and |CSA^*| denotes the maximum of the space needed to store the compressed suffix array CSA of the concatenated text of all documents, or the total space needed to store the individual CSA of each document.

Cite As Get BibTex

Sudip Biswas, Arnab Ganguly, Rahul Shah, and Sharma V. Thankachan. Forbidden Extension Queries. In 35th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2015). Leibniz International Proceedings in Informatics (LIPIcs), Volume 45, pp. 320-335, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2015) https://doi.org/10.4230/LIPIcs.FSTTCS.2015.320

Author Details

Sudip Biswas

Arnab Ganguly

Rahul Shah

Sharma V. Thankachan

References

Sudip Biswas, Arnab Ganguly, Rahul Shah, and Sharma V. Thankachan. Ranked document retrieval with forbidden pattern. In Ferdinando Cicalese, Ely Porat, and Ugo Vaccaro, editors, Combinatorial Pattern Matching - 26th Annual Symposium, CPM 2015, Ischia Island, Italy, June 29 - July 1, 2015, Proceedings, volume 9133 of Lecture Notes in Computer Science, pages 77-88. Springer, 2015.
Gerth Stølting Brodal, Rolf Fagerberg, Mark Greve, and Alejandro López-Ortiz. Online sorted range reporting. In Yingfei Dong, Ding-Zhu Du, and Oscar H. Ibarra, editors, Algorithms and Computation, 20th International Symposium, ISAAC 2009, Honolulu, Hawaii, USA, December 16-18, 2009. Proceedings, volume 5878 of Lecture Notes in Computer Science, pages 173-182. Springer, 2009.
Bernard Chazelle and Leonidas J Guibas. Fractional cascading: I. a data structuring technique. Algorithmica, 1(1-4):133-162, 1986.
Hagai Cohen and Ely Porat. Fast set intersection and two-patterns matching. Theoretical Computer Science, 411(40-42):3795-3800, 2010.
Johannes Fischer, Travis Gagie, Tsvi Kopelowitz, Moshe Lewenstein, Veli Mäkinen, Leena Salmela, and Niko Välimäki. Forbidden patterns. In Proceedings of the 10th Latin American International Conference on Theoretical Informatics, LATIN'12, pages 327-337, Berlin, Heidelberg, 2012. Springer-Verlag.
Johannes Fischer and Volker Heun. Theoretical and practical improvements on the rmq-problem, with applications to LCA and LCE. In Moshe Lewenstein and Gabriel Valiente, editors, Combinatorial Pattern Matching, 17th Annual Symposium, CPM 2006, Barcelona, Spain, July 5-7, 2006, Proceedings, volume 4009 of Lecture Notes in Computer Science, pages 36-48. Springer, 2006.
Johannes Fischer and Volker Heun. A new succinct representation of rmq-information and improvements in the enhanced suffix array. In Bo Chen, Mike Paterson, and Guochuan Zhang, editors, Combinatorics, Algorithms, Probabilistic and Experimental Methodologies, First International Symposium, ESCAPE 2007, Hangzhou, China, April 7-9, 2007, Revised Selected Papers, volume 4614 of Lecture Notes in Computer Science, pages 459-470. Springer, 2007.
Greg N. Frederickson and Donald B. Johnson. The complexity of selection and ranking in X+Y and matrices with sorted columns. J. Comput. Syst. Sci., 24(2):197-208, 1982.
Michael L. Fredman and Dan E. Willard. Trans-dichotomous algorithms for minimum spanning trees and shortest paths. J. Comput. Syst. Sci., 48(3):533-551, 1994.
Alexander Golynski, J. Ian Munro, and S. Srinivasa Rao. Rank/select operations on large alphabets: a tool for text indexing. In Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2006, Miami, Florida, USA, January 22-26, 2006, pages 368-373. ACM Press, 2006.
Dan Gusfield. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology. Cambridge University Press, 1997.
Dov Harel and Robert Endre Tarjan. Fast algorithms for finding nearest common ancestors. siam Journal on Computing, 13(2):338-355, 1984.
Wing-Kai Hon, R. Shah, and J.S. Vitter. Space-efficient framework for top-k string retrieval problems. In Foundations of Computer Science, 2009. FOCS'09. 50th Annual IEEE Symposium on, pages 713-722, Oct 2009.
Wing-Kai Hon, Rahul Shah, Sharma V. Thankachan, and Jeffrey Scott Vitter. Space-efficient frameworks for top-k string retrieval. J. ACM, 61(2):9, 2014.
Wing-Kai Hon, Rahul Shah, SharmaV. Thankachan, and JeffreyScott Vitter. String retrieval for multi-pattern queries. In Edgar Chavez and Stefano Lonardi, editors, String Processing and Information Retrieval, volume 6393 of Lecture Notes in Computer Science, pages 55-66. Springer Berlin Heidelberg, 2010.
Wing-Kai Hon, Rahul Shah, SharmaV. Thankachan, and JeffreyScott Vitter. Document listing for queries with excluded pattern. In Juha Kärkkäinen and Jens Stoye, editors, Combinatorial Pattern Matching, volume 7354 of Lecture Notes in Computer Science, pages 185-195. Springer Berlin Heidelberg, 2012.
Wing-Kai Hon, Sharma V. Thankachan, Rahul Shah, and Jeffrey Scott Vitter. Faster compressed top-k document retrieval. In Ali Bilgin, Michael W. Marcellin, Joan Serra-Sagristà, and James A. Storer, editors, 2013 Data Compression Conference, DCC 2013, Snowbird, UT, USA, March 20-22, 2013, pages 341-350. IEEE, 2013.
KasperGreen Larsen, J.Ian Munro, JesperSindahl Nielsen, and SharmaV. Thankachan. On hardness of several string indexing problems. In AlexanderS. Kulikov, SergeiO. Kuznetsov, and Pavel Pevzner, editors, Combinatorial Pattern Matching, volume 8486 of Lecture Notes in Computer Science, pages 242-251. Springer International Publishing, 2014.
Yossi Matias, S. Muthukrishnan, SüleymanCenk Sahinalp, and Jacob Ziv. Augmenting suffix trees, with applications. In Gianfranco Bilardi, GiuseppeF. Italiano, Andrea Pietracaprina, and Geppino Pucci, editors, Algorithms — ESA’ 98, volume 1461 of Lecture Notes in Computer Science, pages 67-78. Springer Berlin Heidelberg, 1998.
J. Ian Munro. Tables. In Vijay Chandru and V. Vinay, editors, Foundations of Software Technology and Theoretical Computer Science, 16th Conference, Hyderabad, India, December 18-20, 1996, Proceedings, volume 1180 of Lecture Notes in Computer Science, pages 37-42. Springer, 1996.
J. Ian Munro, Gonzalo Navarro, Jesper Sindahl Nielsen, Rahul Shah, and Sharma V. Thankachan. Top- k term-proximity in succinct space. In Hee-Kap Ahn and Chan-Su Shin, editors, Algorithms and Computation - 25th International Symposium, ISAAC 2014, Jeonju, Korea, December 15-17, 2014, Proceedings, volume 8889 of Lecture Notes in Computer Science, pages 169-180. Springer, 2014.
S. Muthukrishnan. Efficient algorithms for document retrieval problems. In David Eppstein, editor, Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, January 6-8, 2002, San Francisco, CA, USA., pages 657-666. ACM/SIAM, 2002.
Gonzalo Navarro and Yakov Nekrich. Top-k document retrieval in optimal time and linear space. In Yuval Rabani, editor, Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2012, Kyoto, Japan, January 17-19, 2012, pages 1066-1077. SIAM, 2012.
Gonzalo Navarro and Yakov Nekrich. Top-k document retrieval in optimal time and linear space. In Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms, pages 1066-1077. SIAM, 2012.
Gonzalo Navarro and Sharma V. Thankachan. Faster top-k document retrieval in optimal space. In Oren Kurland, Moshe Lewenstein, and Ely Porat, editors, String Processing and Information Retrieval - 20th International Symposium, SPIRE 2013, Jerusalem, Israel, October 7-9, 2013, Proceedings, volume 8214 of Lecture Notes in Computer Science, pages 255-262. Springer, 2013.
Manish Patil, Sharma V Thankachan, Rahul Shah, Yakov Nekrich, and Jeffrey Scott Vitter. Categorical range maxima queries. In Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 266-277. ACM, 2014.
Kunihiko Sadakane and Gonzalo Navarro. Fully-functional succinct trees. In Moses Charikar, editor, Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2010, Austin, Texas, USA, January 17-19, 2010, pages 134-149. SIAM, 2010.

Forbidden Extension Queries

Authors Sudip Biswas, Arnab Ganguly, Rahul Shah, Sharma V. Thankachan

File

Document Identifiers

Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message