Document Open Access Logo

Fast Entropy-Bounded String Dictionary Look-Up with Mismatches

Authors Pawel Gawrychowski, Gad M. Landau, Tatiana Starikovskaya



PDF
Thumbnail PDF

File

LIPIcs.MFCS.2018.66.pdf
  • Filesize: 395 kB
  • 15 pages

Document Identifiers

Author Details

Pawel Gawrychowski
  • University of Wrocław, Wrocław, 50-137, Poland
Gad M. Landau
  • University of Haifa, Haifa, 3498838, Israel
Tatiana Starikovskaya
  • DIENS, École normale supérieure, PSL Research University, Paris, 75005, France

Cite AsGet BibTex

Pawel Gawrychowski, Gad M. Landau, and Tatiana Starikovskaya. Fast Entropy-Bounded String Dictionary Look-Up with Mismatches. In 43rd International Symposium on Mathematical Foundations of Computer Science (MFCS 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 117, pp. 66:1-66:15, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2018)
https://doi.org/10.4230/LIPIcs.MFCS.2018.66

Abstract

We revisit the fundamental problem of dictionary look-up with mismatches. Given a set (dictionary) of d strings of length m and an integer k, we must preprocess it into a data structure to answer the following queries: Given a query string Q of length m, find all strings in the dictionary that are at Hamming distance at most k from Q. Chan and Lewenstein (CPM 2015) showed a data structure for k = 1 with optimal query time O(m/w + occ), where w is the size of a machine word and occ is the size of the output. The data structure occupies O(w d log^{1+epsilon} d) extra bits of space (beyond the entropy-bounded space required to store the dictionary strings). In this work we give a solution with similar bounds for a much wider range of values k. Namely, we give a data structure that has O(m/w + log^k d + occ) query time and uses O(w d log^k d) extra bits of space.

Subject Classification

ACM Subject Classification
  • Theory of computation → Pattern matching
Keywords
  • Dictionary look-up
  • Hamming distance
  • compact data structures

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Amihood Amir, Gad M. Landau, Moshe Lewenstein, and Dina Sokol. Dynamic text and static pattern matching. ACM Trans. Algorithms, 3(2), 2007. Google Scholar
  2. Alexandr Andoni and Ilya Razenshteyn. Optimal data-dependent hashing for approximate near neighbors. In Proc. of the Forty-seventh Annual ACM Symposium on Theory of Computing, STOC'15, pages 793-801, 2015. Google Scholar
  3. Djamal Belazzougui. Faster and space-optimal edit distance "1" dictionary. In Proc. of the Annual Symposium on Combinatorial Pattern Matching, CPM'09, pages 154-167, 2009. Google Scholar
  4. Djamal Belazzougui and Rossano Venturini. Compressed string dictionary look-up with edit distance one. In Proc. of the Annual Symposium on Combinatorial Pattern Matching, CPM'12, pages 280-292, 2012. Google Scholar
  5. Philip Bille, Inge Li Gørtz, and Frederik Rye Skjoldjensen. Deterministic indexing for packed strings. In Proc. of the Annual Symposium on Combinatorial Pattern Matching, CPM'17, pages 6:1-6:11, 2017. Google Scholar
  6. Thomas Bocek, Ela Hunt, Burkhard Stiller, and Fabio Hecht. Fast similarity search in large dictionaries. Technical Report ifi-2007.02, Department of Informatics, University of Zurich, 2007. Google Scholar
  7. Gerth Stølting Brodal and Leszek Gasieniec. Approximate dictionary queries. In Proc. of the Annual Symposium on Combinatorial Pattern Matching, CPM'96, pages 65-74, 1996. Google Scholar
  8. Gerth Stølting Brodal and Srinivasan Venkatesh. Improved bounds for dictionary look-up with one error. Inf. Process. Lett., 75:57-59, 2000. Google Scholar
  9. Ho-Leung Chan, Tak Wah Lam, Wing-Kin Sung, Siu-Lung Tam, and Swee-Seong Wong. Compressed indexes for approximate string matching. J. Algorithmica, 58:263-281, 2006. Google Scholar
  10. Ho-Leung Chan, Tak Wah Lam, Wing-Kin Sung, Siu-Lung Tam, and Swee-Seong Wong. A linear size index for approximate pattern matching. In Proc. of the Annual Symposium on Combinatorial Pattern Matching, CPM'06, pages 45-59, 2006. Google Scholar
  11. Timothy Chan and Moshe Lewenstein. Fast string dictionary lookup with one error. In Proc. of the Annual Symposium on Combinatorial Pattern Matching, CPM'15, pages 114-123, 2015. Google Scholar
  12. Aleksander Cisłak and Szymon Grabowski. A practical index for approximate dictionary matching with few mismatches. Computing &Informatics, 36(5):1088-1106, 2017. Google Scholar
  13. Richard Cole, Lee-Ad Gottlieb, and Moshe Lewenstein. Dictionary matching and indexing with errors and don't cares. In Proc. of the 36th Annual ACM Symposium on Theory of Computing, STOC'04, pages 91-100, 2004. Google Scholar
  14. Martin Farach and S. Muthukrishnan. Perfect hashing for strings: Formalization and algorithms. In Proc. of the Annual Symposium on Combinatorial Pattern Matching, CPM'96, pages 130-140, 1996. Google Scholar
  15. Paolo Ferragina and Giovanni Manzini. Indexing compressed text. J. ACM, 52(4):552-581, 2005. Google Scholar
  16. Paolo Ferragina and Rossano Venturini. A simple storage scheme for strings achieving entropy bounds. Theoretical Computer Science, 372(1):115-121, 2007. Google Scholar
  17. Johannes Fischer and Pawel Gawrychowski. Alphabet-dependent string searching with wexponential search trees. In Proc. of the Annual Symposium on Combinatorial Pattern Matching, CPM'15, pages 160-171, 2015. Google Scholar
  18. Johannes Fischer and Volker Heun. Theoretical and practical improvements on the RMQ-problem, with applications to LCA and LCE. In Proc. of the Annual Conference on Combinatorial Pattern Matching, CPM'06, pages 36-48, 2006. Google Scholar
  19. Michael L. Fredman and Dan E. Willard. Surpassing the information theoretic bound with fusion trees. J. Comput. Syst. Sci., 47(3):424-436, dec 1993. Google Scholar
  20. Pawel Gawrychowski, Moshe Lewenstein, and Patrick K. Nicholson. Weighted ancestors in suffix trees. In Proc. of the Annual European Symposium on Algorithms, ESA'14, pages 455-466, 2014. Google Scholar
  21. Dov Harel and Robert Endre Tarjan. Fast algorithms for finding nearest common ancestors. SIAM Journal on Computing, 13(2):338-355, 1984. Google Scholar
  22. Wing-Kai Hon, Tsung-Han Ku, Rahul Shah, Sharma V. Thankachan, and Jeffrey Scott Vitter. Compressed dictionary matching with one error. In Proc. of the Data Compression Conference, DCC'11, pages 113-122, 2011. Google Scholar
  23. Daniel Karch, Dennis Luxen, and Peter Sanders. Improved fast similarity search in dictionaries. In Proc. of the International Symposium on String Processing and Information Retrieval, SPIRE'10, pages 173-178, 2010. Google Scholar
  24. Richard M. Karp and Michael O. Rabin. Efficient randomized pattern-matching algorithms. IBM J. Res. Dev., 31(2):249-260, 1987. Google Scholar
  25. Tak Wah Lam, Wing-Kin Sung, and Swee-Seong Wong. Improved approximate string matching using compressed suffix data structures. J. Algorithmica, 51:298-314, 2005. Google Scholar
  26. Giovanni Manzini. An analysis of the Burrows -Wheeler transform. J. ACM, 48(3):407-430, may 2001. Google Scholar
  27. Moshe Mor and Aviezri S. Fraenkel. A hash code method for detecting and correcting spelling errors. Commun. ACM, 25(12):935-938, 1982. Google Scholar
  28. Milan Ružić. Uniform deterministic dictionaries. ACM Trans. Algorithms, 4(1):1:1-1:23, mar 2008. Google Scholar
  29. Takuya Takagi, Shunsuke Inenaga, Kunihiko Sadakane, and Hiroki Arimura. Packed compact tries: A fast and efficient data structure for online string processing. In Proc. of the 27th International Workshop on Combinatorial Algorithms, volume 9843 of IWOCA'16, pages 213-225. Springer, 2016. Google Scholar
  30. Dan E. Willard. Log-logarithmic worst-case range queries are possible in space o(n). Information Processing Letters, 17(2):81-84, 1983. Google Scholar
  31. Andrew Chi-Chih Yao and Foong Frances Yao. Dictionary look-up with one error. J. Algorithms, 25:194-202, 1997. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail