Document Open Access Logo

Forbidden Extension Queries

Authors Sudip Biswas, Arnab Ganguly, Rahul Shah, Sharma V. Thankachan



PDF
Thumbnail PDF

File

LIPIcs.FSTTCS.2015.320.pdf
  • Filesize: 0.67 MB
  • 16 pages

Document Identifiers

Author Details

Sudip Biswas
Arnab Ganguly
Rahul Shah
Sharma V. Thankachan

Cite AsGet BibTex

Sudip Biswas, Arnab Ganguly, Rahul Shah, and Sharma V. Thankachan. Forbidden Extension Queries. In 35th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2015). Leibniz International Proceedings in Informatics (LIPIcs), Volume 45, pp. 320-335, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2015)
https://doi.org/10.4230/LIPIcs.FSTTCS.2015.320

Abstract

Document retrieval is one of the most fundamental problem in information retrieval. The objective is to retrieve all documents from a document collection that are relevant to an input pattern. Several variations of this problem such as ranked document retrieval, document listing with two patterns and forbidden patterns have been studied. We introduce the problem of document retrieval with forbidden extensions. Let D={T_1,T_2,...,T_D} be a collection of D string documents of n characters in total, and P^+ and P^- be two query patterns, where P^+ is a proper prefix of P^-. We call P^- as the forbidden extension of the included pattern P^+. A forbidden extension query < P^+,P^- > asks to report all occ documents in D that contains P^+ as a substring, but does not contain P^- as one. A top-k forbidden extension query < P^+,P^-,k > asks to report those k documents among the occ documents that are most relevant to P^+. We present a linear index (in words) with an O(|P^-| + occ) query time for the document listing problem. For the top-k version of the problem, we achieve the following results, when the relevance of a document is based on PageRank: - an O(n) space (in words) index with O(|P^-|log sigma+ k) query time, where sigma is the size of the alphabet from which characters in D are chosen. For constant alphabets, this yields an optimal query time of O(|P^-|+ k). - for any constant epsilon > 0, a |CSA| + |CSA^*| + Dlog frac{n}{D} + O(n) bits index with O(search(P)+ k cdot tsa cdot log ^{2+epsilon} n) query time, where search(P) is the time to find the suffix range of a pattern P, tsa is the time to find suffix (or inverse suffix) array value, and |CSA^*| denotes the maximum of the space needed to store the compressed suffix array CSA of the concatenated text of all documents, or the total space needed to store the individual CSA of each document.
Keywords
  • document retrieval
  • suffix trees
  • range queries
  • succinct data structure

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Sudip Biswas, Arnab Ganguly, Rahul Shah, and Sharma V. Thankachan. Ranked document retrieval with forbidden pattern. In Ferdinando Cicalese, Ely Porat, and Ugo Vaccaro, editors, Combinatorial Pattern Matching - 26th Annual Symposium, CPM 2015, Ischia Island, Italy, June 29 - July 1, 2015, Proceedings, volume 9133 of Lecture Notes in Computer Science, pages 77-88. Springer, 2015. Google Scholar
  2. Gerth Stølting Brodal, Rolf Fagerberg, Mark Greve, and Alejandro López-Ortiz. Online sorted range reporting. In Yingfei Dong, Ding-Zhu Du, and Oscar H. Ibarra, editors, Algorithms and Computation, 20th International Symposium, ISAAC 2009, Honolulu, Hawaii, USA, December 16-18, 2009. Proceedings, volume 5878 of Lecture Notes in Computer Science, pages 173-182. Springer, 2009. Google Scholar
  3. Bernard Chazelle and Leonidas J Guibas. Fractional cascading: I. a data structuring technique. Algorithmica, 1(1-4):133-162, 1986. Google Scholar
  4. Hagai Cohen and Ely Porat. Fast set intersection and two-patterns matching. Theoretical Computer Science, 411(40-42):3795-3800, 2010. Google Scholar
  5. Johannes Fischer, Travis Gagie, Tsvi Kopelowitz, Moshe Lewenstein, Veli Mäkinen, Leena Salmela, and Niko Välimäki. Forbidden patterns. In Proceedings of the 10th Latin American International Conference on Theoretical Informatics, LATIN'12, pages 327-337, Berlin, Heidelberg, 2012. Springer-Verlag. Google Scholar
  6. Johannes Fischer and Volker Heun. Theoretical and practical improvements on the rmq-problem, with applications to LCA and LCE. In Moshe Lewenstein and Gabriel Valiente, editors, Combinatorial Pattern Matching, 17th Annual Symposium, CPM 2006, Barcelona, Spain, July 5-7, 2006, Proceedings, volume 4009 of Lecture Notes in Computer Science, pages 36-48. Springer, 2006. Google Scholar
  7. Johannes Fischer and Volker Heun. A new succinct representation of rmq-information and improvements in the enhanced suffix array. In Bo Chen, Mike Paterson, and Guochuan Zhang, editors, Combinatorics, Algorithms, Probabilistic and Experimental Methodologies, First International Symposium, ESCAPE 2007, Hangzhou, China, April 7-9, 2007, Revised Selected Papers, volume 4614 of Lecture Notes in Computer Science, pages 459-470. Springer, 2007. Google Scholar
  8. Greg N. Frederickson and Donald B. Johnson. The complexity of selection and ranking in X+Y and matrices with sorted columns. J. Comput. Syst. Sci., 24(2):197-208, 1982. Google Scholar
  9. Michael L. Fredman and Dan E. Willard. Trans-dichotomous algorithms for minimum spanning trees and shortest paths. J. Comput. Syst. Sci., 48(3):533-551, 1994. Google Scholar
  10. Alexander Golynski, J. Ian Munro, and S. Srinivasa Rao. Rank/select operations on large alphabets: a tool for text indexing. In Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2006, Miami, Florida, USA, January 22-26, 2006, pages 368-373. ACM Press, 2006. Google Scholar
  11. Dan Gusfield. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology. Cambridge University Press, 1997. Google Scholar
  12. Dov Harel and Robert Endre Tarjan. Fast algorithms for finding nearest common ancestors. siam Journal on Computing, 13(2):338-355, 1984. Google Scholar
  13. Wing-Kai Hon, R. Shah, and J.S. Vitter. Space-efficient framework for top-k string retrieval problems. In Foundations of Computer Science, 2009. FOCS'09. 50th Annual IEEE Symposium on, pages 713-722, Oct 2009. Google Scholar
  14. Wing-Kai Hon, Rahul Shah, Sharma V. Thankachan, and Jeffrey Scott Vitter. Space-efficient frameworks for top-k string retrieval. J. ACM, 61(2):9, 2014. Google Scholar
  15. Wing-Kai Hon, Rahul Shah, SharmaV. Thankachan, and JeffreyScott Vitter. String retrieval for multi-pattern queries. In Edgar Chavez and Stefano Lonardi, editors, String Processing and Information Retrieval, volume 6393 of Lecture Notes in Computer Science, pages 55-66. Springer Berlin Heidelberg, 2010. Google Scholar
  16. Wing-Kai Hon, Rahul Shah, SharmaV. Thankachan, and JeffreyScott Vitter. Document listing for queries with excluded pattern. In Juha Kärkkäinen and Jens Stoye, editors, Combinatorial Pattern Matching, volume 7354 of Lecture Notes in Computer Science, pages 185-195. Springer Berlin Heidelberg, 2012. Google Scholar
  17. Wing-Kai Hon, Sharma V. Thankachan, Rahul Shah, and Jeffrey Scott Vitter. Faster compressed top-k document retrieval. In Ali Bilgin, Michael W. Marcellin, Joan Serra-Sagristà, and James A. Storer, editors, 2013 Data Compression Conference, DCC 2013, Snowbird, UT, USA, March 20-22, 2013, pages 341-350. IEEE, 2013. Google Scholar
  18. KasperGreen Larsen, J.Ian Munro, JesperSindahl Nielsen, and SharmaV. Thankachan. On hardness of several string indexing problems. In AlexanderS. Kulikov, SergeiO. Kuznetsov, and Pavel Pevzner, editors, Combinatorial Pattern Matching, volume 8486 of Lecture Notes in Computer Science, pages 242-251. Springer International Publishing, 2014. Google Scholar
  19. Yossi Matias, S. Muthukrishnan, SüleymanCenk Sahinalp, and Jacob Ziv. Augmenting suffix trees, with applications. In Gianfranco Bilardi, GiuseppeF. Italiano, Andrea Pietracaprina, and Geppino Pucci, editors, Algorithms — ESA’ 98, volume 1461 of Lecture Notes in Computer Science, pages 67-78. Springer Berlin Heidelberg, 1998. Google Scholar
  20. J. Ian Munro. Tables. In Vijay Chandru and V. Vinay, editors, Foundations of Software Technology and Theoretical Computer Science, 16th Conference, Hyderabad, India, December 18-20, 1996, Proceedings, volume 1180 of Lecture Notes in Computer Science, pages 37-42. Springer, 1996. Google Scholar
  21. J. Ian Munro, Gonzalo Navarro, Jesper Sindahl Nielsen, Rahul Shah, and Sharma V. Thankachan. Top- k term-proximity in succinct space. In Hee-Kap Ahn and Chan-Su Shin, editors, Algorithms and Computation - 25th International Symposium, ISAAC 2014, Jeonju, Korea, December 15-17, 2014, Proceedings, volume 8889 of Lecture Notes in Computer Science, pages 169-180. Springer, 2014. Google Scholar
  22. S. Muthukrishnan. Efficient algorithms for document retrieval problems. In David Eppstein, editor, Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, January 6-8, 2002, San Francisco, CA, USA., pages 657-666. ACM/SIAM, 2002. Google Scholar
  23. Gonzalo Navarro and Yakov Nekrich. Top-k document retrieval in optimal time and linear space. In Yuval Rabani, editor, Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2012, Kyoto, Japan, January 17-19, 2012, pages 1066-1077. SIAM, 2012. Google Scholar
  24. Gonzalo Navarro and Yakov Nekrich. Top-k document retrieval in optimal time and linear space. In Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms, pages 1066-1077. SIAM, 2012. Google Scholar
  25. Gonzalo Navarro and Sharma V. Thankachan. Faster top-k document retrieval in optimal space. In Oren Kurland, Moshe Lewenstein, and Ely Porat, editors, String Processing and Information Retrieval - 20th International Symposium, SPIRE 2013, Jerusalem, Israel, October 7-9, 2013, Proceedings, volume 8214 of Lecture Notes in Computer Science, pages 255-262. Springer, 2013. Google Scholar
  26. Manish Patil, Sharma V Thankachan, Rahul Shah, Yakov Nekrich, and Jeffrey Scott Vitter. Categorical range maxima queries. In Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 266-277. ACM, 2014. Google Scholar
  27. Kunihiko Sadakane and Gonzalo Navarro. Fully-functional succinct trees. In Moses Charikar, editor, Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2010, Austin, Texas, USA, January 17-19, 2010, pages 134-149. SIAM, 2010. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail