Fast Entropy-Bounded String Dictionary Look-Up with Mismatches

Gawrychowski, Pawel; Landau, Gad M.; Starikovskaya, Tatiana

doi:10.4230/LIPIcs.MFCS.2018.66

File

Author Details

Pawel Gawrychowski

University of Wrocław, Wrocław, 50-137, Poland

Gad M. Landau

University of Haifa, Haifa, 3498838, Israel

Tatiana Starikovskaya

DIENS, École normale supérieure, PSL Research University, Paris, 75005, France

Cite AsGet BibTex

Pawel Gawrychowski, Gad M. Landau, and Tatiana Starikovskaya. Fast Entropy-Bounded String Dictionary Look-Up with Mismatches. In 43rd International Symposium on Mathematical Foundations of Computer Science (MFCS 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 117, pp. 66:1-66:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)
https://doi.org/10.4230/LIPIcs.MFCS.2018.66

Abstract

We revisit the fundamental problem of dictionary look-up with mismatches. Given a set (dictionary) of d strings of length m and an integer k, we must preprocess it into a data structure to answer the following queries: Given a query string Q of length m, find all strings in the dictionary that are at Hamming distance at most k from Q. Chan and Lewenstein (CPM 2015) showed a data structure for k = 1 with optimal query time O(m/w + occ), where w is the size of a machine word and occ is the size of the output. The data structure occupies O(w d log^{1+epsilon} d) extra bits of space (beyond the entropy-bounded space required to store the dictionary strings). In this work we give a solution with similar bounds for a much wider range of values k. Namely, we give a data structure that has O(m/w + log^k d + occ) query time and uses O(w d log^k d) extra bits of space.

Subject Classification

ACM Subject Classification

Theory of computation → Pattern matching

Keywords

Dictionary look-up
Hamming distance
compact data structures

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

Amihood Amir, Gad M. Landau, Moshe Lewenstein, and Dina Sokol. Dynamic text and static pattern matching. ACM Trans. Algorithms, 3(2), 2007.
Alexandr Andoni and Ilya Razenshteyn. Optimal data-dependent hashing for approximate near neighbors. In Proc. of the Forty-seventh Annual ACM Symposium on Theory of Computing, STOC'15, pages 793-801, 2015.
Djamal Belazzougui. Faster and space-optimal edit distance "1" dictionary. In Proc. of the Annual Symposium on Combinatorial Pattern Matching, CPM'09, pages 154-167, 2009.
Djamal Belazzougui and Rossano Venturini. Compressed string dictionary look-up with edit distance one. In Proc. of the Annual Symposium on Combinatorial Pattern Matching, CPM'12, pages 280-292, 2012.
Philip Bille, Inge Li Gørtz, and Frederik Rye Skjoldjensen. Deterministic indexing for packed strings. In Proc. of the Annual Symposium on Combinatorial Pattern Matching, CPM'17, pages 6:1-6:11, 2017.
Thomas Bocek, Ela Hunt, Burkhard Stiller, and Fabio Hecht. Fast similarity search in large dictionaries. Technical Report ifi-2007.02, Department of Informatics, University of Zurich, 2007.
Gerth Stølting Brodal and Leszek Gasieniec. Approximate dictionary queries. In Proc. of the Annual Symposium on Combinatorial Pattern Matching, CPM'96, pages 65-74, 1996.
Gerth Stølting Brodal and Srinivasan Venkatesh. Improved bounds for dictionary look-up with one error. Inf. Process. Lett., 75:57-59, 2000.
Ho-Leung Chan, Tak Wah Lam, Wing-Kin Sung, Siu-Lung Tam, and Swee-Seong Wong. Compressed indexes for approximate string matching. J. Algorithmica, 58:263-281, 2006.
Ho-Leung Chan, Tak Wah Lam, Wing-Kin Sung, Siu-Lung Tam, and Swee-Seong Wong. A linear size index for approximate pattern matching. In Proc. of the Annual Symposium on Combinatorial Pattern Matching, CPM'06, pages 45-59, 2006.
Timothy Chan and Moshe Lewenstein. Fast string dictionary lookup with one error. In Proc. of the Annual Symposium on Combinatorial Pattern Matching, CPM'15, pages 114-123, 2015.
Aleksander Cisłak and Szymon Grabowski. A practical index for approximate dictionary matching with few mismatches. Computing &Informatics, 36(5):1088-1106, 2017.
Richard Cole, Lee-Ad Gottlieb, and Moshe Lewenstein. Dictionary matching and indexing with errors and don't cares. In Proc. of the 36th Annual ACM Symposium on Theory of Computing, STOC'04, pages 91-100, 2004.
Martin Farach and S. Muthukrishnan. Perfect hashing for strings: Formalization and algorithms. In Proc. of the Annual Symposium on Combinatorial Pattern Matching, CPM'96, pages 130-140, 1996.
Paolo Ferragina and Giovanni Manzini. Indexing compressed text. J. ACM, 52(4):552-581, 2005.
Paolo Ferragina and Rossano Venturini. A simple storage scheme for strings achieving entropy bounds. Theoretical Computer Science, 372(1):115-121, 2007.
Johannes Fischer and Pawel Gawrychowski. Alphabet-dependent string searching with wexponential search trees. In Proc. of the Annual Symposium on Combinatorial Pattern Matching, CPM'15, pages 160-171, 2015.
Johannes Fischer and Volker Heun. Theoretical and practical improvements on the RMQ-problem, with applications to LCA and LCE. In Proc. of the Annual Conference on Combinatorial Pattern Matching, CPM'06, pages 36-48, 2006.
Michael L. Fredman and Dan E. Willard. Surpassing the information theoretic bound with fusion trees. J. Comput. Syst. Sci., 47(3):424-436, dec 1993.
Pawel Gawrychowski, Moshe Lewenstein, and Patrick K. Nicholson. Weighted ancestors in suffix trees. In Proc. of the Annual European Symposium on Algorithms, ESA'14, pages 455-466, 2014.
Dov Harel and Robert Endre Tarjan. Fast algorithms for finding nearest common ancestors. SIAM Journal on Computing, 13(2):338-355, 1984.
Wing-Kai Hon, Tsung-Han Ku, Rahul Shah, Sharma V. Thankachan, and Jeffrey Scott Vitter. Compressed dictionary matching with one error. In Proc. of the Data Compression Conference, DCC'11, pages 113-122, 2011.
Daniel Karch, Dennis Luxen, and Peter Sanders. Improved fast similarity search in dictionaries. In Proc. of the International Symposium on String Processing and Information Retrieval, SPIRE'10, pages 173-178, 2010.
Richard M. Karp and Michael O. Rabin. Efficient randomized pattern-matching algorithms. IBM J. Res. Dev., 31(2):249-260, 1987.
Tak Wah Lam, Wing-Kin Sung, and Swee-Seong Wong. Improved approximate string matching using compressed suffix data structures. J. Algorithmica, 51:298-314, 2005.
Giovanni Manzini. An analysis of the Burrows -Wheeler transform. J. ACM, 48(3):407-430, may 2001.
Moshe Mor and Aviezri S. Fraenkel. A hash code method for detecting and correcting spelling errors. Commun. ACM, 25(12):935-938, 1982.
Milan Ružić. Uniform deterministic dictionaries. ACM Trans. Algorithms, 4(1):1:1-1:23, mar 2008.
Takuya Takagi, Shunsuke Inenaga, Kunihiko Sadakane, and Hiroki Arimura. Packed compact tries: A fast and efficient data structure for online string processing. In Proc. of the 27th International Workshop on Combinatorial Algorithms, volume 9843 of IWOCA'16, pages 213-225. Springer, 2016.
Dan E. Willard. Log-logarithmic worst-case range queries are possible in space o(n). Information Processing Letters, 17(2):81-84, 1983.
Andrew Chi-Chih Yao and Foong Frances Yao. Dictionary look-up with one error. J. Algorithms, 25:194-202, 1997.

Fast Entropy-Bounded String Dictionary Look-Up with Mismatches

Authors Pawel Gawrychowski, Gad M. Landau, Tatiana Starikovskaya

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

Fast Entropy-Bounded String Dictionary Look-Up with Mismatches

Authors Pawel Gawrychowski, Gad M. Landau, Tatiana Starikovskaya

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Related Versions

References

Thanks for your feedback!

Could not send message