Unary Words Have the Smallest Levenshtein k-Neighbourhoods

Charalampopoulos, Panagiotis; Pissis, Solon P.; Radoszewski, Jakub; Waleń, Tomasz; Zuba, Wiktor

doi:10.4230/LIPIcs.CPM.2020.10

Abstract

The edit distance (a.k.a. the Levenshtein distance) between two words is defined as the minimum number of insertions, deletions or substitutions of letters needed to transform one word into another. The Levenshtein k-neighbourhood of a word w is the set of words that are at edit distance at most k from w. This is perhaps the most important concept underlying BLAST, a widely-used tool for comparing biological sequences. A natural combinatorial question is to ask for upper and lower bounds on the size of this set. The answer to this question has important algorithmic implications as well. Myers notes that "such bounds would give a tighter characterisation of the running time of the algorithm" behind BLAST. We show that the size of the Levenshtein k-neighbourhood of any word of length n over an arbitrary alphabet is not smaller than the size of the Levenshtein k-neighbourhood of a unary word of length n, thus providing a tight lower bound on the size of the Levenshtein k-neighbourhood. We remark that this result was posed as a conjecture by Dufresne at WCTA 2019.

Cite As Get BibTex

Panagiotis Charalampopoulos, Solon P. Pissis, Jakub Radoszewski, Tomasz Waleń, and Wiktor Zuba. Unary Words Have the Smallest Levenshtein k-Neighbourhoods. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 10:1-10:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020) https://doi.org/10.4230/LIPIcs.CPM.2020.10

Author Details

Panagiotis Charalampopoulos

Department of Informatics, King’s College London, UK
Institute of Informatics, University of Warsaw, Poland

Solon P. Pissis

CWI, Amsterdam, The Netherlands
Vrije Universiteit, Amsterdam, The Netherlands
ERABLE Team, Lyon, France

Jakub Radoszewski

Institute of Informatics, University of Warsaw, Poland
Samsung R&D, Warsaw, Poland

Tomasz Waleń

Institute of Informatics, University of Warsaw, Poland

Wiktor Zuba

Institute of Informatics, University of Warsaw, Poland

Funding

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 872539.

Charalampopoulos, Panagiotis: Supported by ERC grant TOTAL under the European Union’s Horizon 2020 Research and Innovation Programme (agreement no. 677651).
Radoszewski, Jakub: Supported by the Polish National Science Center, grant number 2018/31/D/ST6/03991.
Waleń, Tomasz: Supported by the Polish National Science Center, grant number 2018/31/D/ST6/03991.
Zuba, Wiktor: Supported by the Polish National Science Center, grant number 2018/31/D/ST6/03991.

References

Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215(3):403-410, 1990. URL: https://doi.org/10.1016/S0022-2836(05)80360-2.
Leonor Becerra-Bonache, Colin de la Higuera, Jean-Christophe Janodet, and Frédéric Tantini. Learning balls of strings from edit corrections. The Journal of Machine Learning Research, 9:1841-1870, 2008. URL: https://dl.acm.org/citation.cfm?id=1442793.
Edgar Chávez, Gonzalo Navarro, Ricardo A. Baeza-Yates, and José L. Marroquín. Searching in metric spaces. ACM Computing Surveys, 33(3):273-321, 2001. URL: https://doi.org/10.1145/502807.502808.
Maxime Crochemore, Christophe Hancart, and Thierry Lecroq. Algorithms on strings. Cambridge University Press, 2007.
Yoann Dufresne. An exploration of Levenshtein neighborhood densities. 14th Workshop on Compression, Text and Algorithms (WCTA), 2019. Talk.
Vladimir I. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady, 10:707, 1966.
Stoyan Mihov and Klaus U. Schulz. Fast approximate search in large dictionaries. Computational Linguistics, 30(4):451-477, December 2004. URL: https://doi.org/10.1162/0891201042544938.
Gene Myers. What’s behind BLAST. In Cédric Chauve, Nadia El-Mabrouk, and Eric Tannier, editors, Models and Algorithms for Genome Evolution, pages 3-15. Springer, 2013. URL: https://doi.org/10.1007/978-1-4471-5298-9_1.
Gonzalo Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31-88, 2001. URL: https://doi.org/10.1145/375360.375365.
Marie-France Sagot and Yoshiko Wakabayashi. Pattern inference under many guises. In Bruce A. Reed and Cláudia L. Sales, editors, Recent Advances in Algorithms and Combinatorics, pages 245-287. Springer New York, 2003. URL: https://doi.org/10.1007/0-387-22444-0_8.
Hélène Touzet. On the Levenshtein automaton and the size of the neighbourhood of a word. In Language and Automata Theory and Applications - 10th International Conference, LATA 2016, Proceedings, pages 207-218, 2016. URL: https://doi.org/10.1007/978-3-319-30000-9_16.

Unary Words Have the Smallest Levenshtein k-Neighbourhoods

Authors Panagiotis Charalampopoulos , Solon P. Pissis , Jakub Radoszewski , Tomasz Waleń , Wiktor Zuba

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message