Text Indexing and Searching in Sublinear Time

Authors J. Ian Munro, Gonzalo Navarro, Yakov Nekrich



PDF
Thumbnail PDF

File

LIPIcs.CPM.2020.24.pdf
  • Filesize: 481 kB
  • 15 pages

Document Identifiers

Author Details

J. Ian Munro
  • Cheriton School of Computer Science, University of Waterloo, Canada
Gonzalo Navarro
  • CeBiB - Center of Biotechnology and Bioengineering, Department of Computer Science, University of Chile, Santiago, Chile
Yakov Nekrich
  • Department of Computer Science, Michigan Technological University, Houghton, MI, USA

Cite As Get BibTex

J. Ian Munro, Gonzalo Navarro, and Yakov Nekrich. Text Indexing and Searching in Sublinear Time. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 24:1-24:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020) https://doi.org/10.4230/LIPIcs.CPM.2020.24

Abstract

We introduce the first index that can be built in o(n) time for a text of length n, and can also be queried in o(q) time for a pattern of length q. On an alphabet of size σ, our index uses O(n log σ) bits, is built in O(n log σ / √{log n}) deterministic time, and computes the number of occurrences of the pattern in time O(q/log_σ n + log n log_σ n). Each such occurrence can then be found in O(log n) time. Other trade-offs between the space usage and the cost of reporting occurrences are also possible.

Subject Classification

ACM Subject Classification
  • Theory of computation → Data structures design and analysis
Keywords
  • data structures
  • string indexes

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. S. Albers and T. Hagerup. Improved parallel integer sorting without concurrent writing. Information and Computation, 136(1):25-51, 1997. Google Scholar
  2. M. Babenko, P. Gawrychowski, T. Kociumaka, and T. Starikovskaya. Wavelet trees meet suffix trees. In Proc. 26th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 572-591, 2015. Google Scholar
  3. J. Barbay, F. Claude, T. Gagie, G. Navarro, and Y. Nekrich. Efficient fully-compressed sequence representations. Algorithmica", !PUBLISHER = "Springer, 69(1):232-268, 2014. Google Scholar
  4. D. Belazzougui, P. Boldi, and S. Vigna. Dynamic z-fast tries. In Proc. 17th International Symposium on String Processing and Information Retrieval (SPIRE), pages 159-172, 2010. Google Scholar
  5. D. Belazzougui and G. Navarro. Alphabet-independent compressed text indexing. ACM Transactions on Algorithms, 10(4):article 23, 2014. Google Scholar
  6. D. Belazzougui and G. Navarro. Optimal lower and upper bounds for representing sequences. ACM Transactions on Algorithms, 11(4):article 31, 2015. Google Scholar
  7. D. Belazzougui and S. J. Puglisi. Range predecessor and Lempel-Ziv parsing. In Proc. 27th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2053-2071, 2016. Google Scholar
  8. M. Bender and M. Farach-Colton. The level ancestor problem simplified. Theoretical Computer Science, 321(1):5-12, 2004. Google Scholar
  9. M. A. Bender and M. Farach-Colton. The LCA problem revisited. In Proc. 4th Latin American Symposiumon Theoretical Informatics (LATIN), pages 88-94, 2000. URL: https://doi.org/10.1007/10719839_9.
  10. M. A. Bender, M. Farach-Colton, G. Pemmasani, S. Skiena, and P. Sumazin. Lowest common ancestors in trees and directed acyclic graphs. Journal of Algorithms, 57(2):75-94, 2005. Google Scholar
  11. P. Bille, I. L. Gørtz, and F. R. Skjoldjensen. Deterministic indexing for packed strings. In Proc. 28th Annual Symposium on Combinatorial Pattern Matching (CPM), pages 6:1-6:11, 2017. Google Scholar
  12. O. Birenzwige, S. Golan, and E. Porat. Locally consistent parsing for text indexing in small space. In Proc. 31st ACM-SIAM Symposium on Discrete Algorithms, (SODA), pages 607-626, 2020. Google Scholar
  13. T. M. Chan, K. G. Larsen, and M. Patrascu. Orthogonal range searching on the RAM, revisited. In Proc. 27th ACM Symposium on Computational Geometry (SoCG), pages 1-10, 2011. Google Scholar
  14. B. Chazelle. A functional approach to data structures and its use in multidimensional searching. SIAM Journal on Computing, 17(3):427-462, 1988. URL: https://doi.org/10.1137/0217026.
  15. Y.-F. Chien, W.-K. Hon, R. Shah, S. V. Thankachan, and J. S. Vitter. Geometric BWT: Compressed text indexing via sparse suffixes and range searching. Algorithmica, 71(2):258-278, 2015. Google Scholar
  16. D. R. Clark. Compact PAT Trees. PhD thesis, University of Waterloo, Canada, 1996. Google Scholar
  17. R. Cole, T. Kopelowitz, and M. Lewenstein. Suffix trays and suffix trists: Structures for faster text indexing. Algorithmica, 72(2):450-466, 2015. Google Scholar
  18. T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. MIT Press, 3rd edition, 2009. Google Scholar
  19. M. Farach-Colton, P. Ferragina, and S. Muthukrishnan. On the sorting-complexity of suffix tree construction. Journal of the ACM, 47(6):987-1011, 2000. Google Scholar
  20. G. Feigenblat, E. Porat, and A. Shiftan. Linear time succinct indexable dictionary construction with applications. In Proc. 26th Data Compression Conference (DCC), pages 13-22, 2016. Google Scholar
  21. J. Fischer and P. Gawrychowski. Alphabet-dependent string searching with wexponential search trees. In Proc. 26th Annual Symposium on Combinatorial Pattern Matching (CPM), pages 160-171, !series = "LNCS 9133", 2015. Google Scholar
  22. R. González, G. Navarro, and H. Ferrada. Locally compressed suffix arrays. ACM Journal of Experimental Algorithmics, 19(1):article 1, 2014. Google Scholar
  23. R. Grossi, A. Gupta, and J. S. Vitter. High-order entropy-compressed text indexes. In Proc. 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 841-850, 2003. Google Scholar
  24. D. Kempa and T. Kociumaka. String synchronizing sets: Sublinear-time BWT construction and optimal LCE data structure. In Proc. 51st Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 756-767, 2019. Google Scholar
  25. G. M. Landau and U. Vishkin. Fast string matching with k differences. Journal of Computer and System Sciences, 37(1):63-78, 1988. Google Scholar
  26. U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 22(5):935-948, 1993. Google Scholar
  27. E. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM, 23(2):262-272, 1976. Google Scholar
  28. J. I. Munro. Tables. In Proc. 16th FSTTCS, pages 37-42, 1996. Google Scholar
  29. J. I. Munro, G. Navarro, and Y. Nekrich. Space-efficient construction of compressed indexes in deterministic linear time. CoRR, abs/1607.04346, 2016. Google Scholar
  30. J. I. Munro, G. Navarro, and Y. Nekrich. Space-efficient construction of compressed indexes in deterministic linear time. In Proc. 28th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 408-424, 2017. Google Scholar
  31. J. I. Munro, G. Navarro, and Y. Nekrich. Text indexing and searching in sublinear time. CoRR, abs/1712.07431, 2017. Google Scholar
  32. J. I. Munro, G. Navarro, and Y. Nekrich. Fast compressed self-indexes with deterministic linear-time construction. Algorithmica, 82(2):316-337, 2020. Google Scholar
  33. J. I. Munro, Y. Nekrich, and J. S. Vitter. Fast construction of wavelet trees. Theoretical Computer Science, 638:91-97, 2016", !note = "Preliminary version appeared in SPIRE'14. Google Scholar
  34. G. Navarro. Wavelet trees for all. Journal of Discrete Algorithms, 25:2-20, 2014. Google Scholar
  35. G. Navarro and V. Mäkinen. Compressed full-text indexes. ACM Computing Surveys, 39(1):article 2, 2007. Google Scholar
  36. G. Navarro and Y. Nekrich. Time-optimal top-k document retrieval. SIAM Journal on Computing, 46(1):89-113, 2017. Google Scholar
  37. Y. Nekrich. Orthogonal range searching in linear and almost-linear space. Computational Geometry, 42(4):342-351, 2009. Google Scholar
  38. S. S. Rao. Time-space trade-offs for compressed suffix arrays. Information Processing Letters, 82(6):307-311, 2002. Google Scholar
  39. M. Ruzic. Constructing efficient dictionaries in close to sorting time. In Proc. 35th International Colloquium on Automata, Languages and Programming (ICALP A), pages 84-95 (part I), 2008, !series = "LNCS 5125". Google Scholar
  40. Y. Tanimura, T. Nishimoto, H. Bannai, S. Inenaga, and M. Takeda. Small-space LCE data structure with constant-time queries. In Proc. 42nd International Symposium on Mathematical Foundations of Computer Science (MFCS), pages 10:1-10:15, 2017. Google Scholar
  41. E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249-260, 1995. Google Scholar
  42. P. Weiner. Linear pattern matching algorithms. In Proc. 14th IEEE Symposium on Foundations on Computer Science (FOCS), pages 1-11, 1973. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail