On Distances Between Words with Parameters

Authors Pierre Bourhis , Aaron Boussidan, Philippe Gambette



PDF
Thumbnail PDF

File

LIPIcs.CPM.2023.6.pdf
  • Filesize: 0.91 MB
  • 23 pages

Document Identifiers

Author Details

Pierre Bourhis
  • Univ. Lille, CNRS, Inria, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France
Aaron Boussidan
  • LIGM, Université Gustave Eiffel, CNRS, Marne-la-Vallée, France
Philippe Gambette
  • LIGM, Université Gustave Eiffel, CNRS, Marne-la-Vallée, France

Cite AsGet BibTex

Pierre Bourhis, Aaron Boussidan, and Philippe Gambette. On Distances Between Words with Parameters. In 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 259, pp. 6:1-6:23, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)
https://doi.org/10.4230/LIPIcs.CPM.2023.6

Abstract

The edit distance between parameterized words is a generalization of the classical edit distance where it is allowed to map particular letters of the first word, called parameters, to parameters of the second word before computing the distance. This problem has been introduced in particular for detection of code duplication, and the notion of words with parameters has also been used with different semantics in other fields. The complexity of several variants of edit distances between parameterized words has been studied, however, the complexity of the most natural one, the Levenshtein distance, remained open. In this paper, we solve this open question and close the exhaustive analysis of all cases of parameterized word matching and function matching, showing that these problems are np-complete. To this aim, we also provide a comparison of the different problems, exhibiting several equivalences between them. We also provide and implement a MaxSAT encoding of the problem, as well as a simple FPT algorithm in the alphabet size, and study their efficiency on real data in the context of theater play structure comparison.

Subject Classification

ACM Subject Classification
  • Theory of computation → Pattern matching
Keywords
  • String matching
  • edit distance
  • Levenshtein
  • parameterized matching
  • parameterized words
  • parameter words
  • instantiable words
  • NP-completeness
  • MAX-SAT

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Rakesh Agrawal, Christos Faloutsos, and Arun Swami. Efficient similarity search in sequence databases. In International conference on foundations of data organization and algorithms, pages 69-84. Springer, 1993. Google Scholar
  2. Amihood Amir, Yonatan Aumann, Richard Cole, Moshe Lewenstein, and Ely Porat. Function matching: Algorithms, applications, and a lower bound. In Proceedings of the 30th International Conference on Automata, Languages and Programming, pages 929-942, 2003. URL: https://doi.org/10.1007/3-540-45061-0_72.
  3. Amihood Amir, Martin Farach, and S. Muthukrishnan. Alphabet dependence in parameterized matching. Information Processing Letters, 49(3):111-115, 1994. URL: https://doi.org/10.1016/0020-0190(94)90086-8.
  4. Dana Angluin. Finding patterns common to a set of strings. J. Comput. Syst. Sci., 21(1):46-62, 1980. URL: https://doi.org/10.1016/0022-0000(80)90041-0.
  5. Alberto Apostolico, Péter L. Erdős, and Moshe Lewenstein. Parameterized matching with mismatches. Journal of Discrete Algorithms, 5(1):135-140, 2007. URL: https://doi.org/10.1016/j.jda.2006.03.014.
  6. Brenda S. Baker. A theory of parameterized pattern matching: Algorithms and applications. In Proceedings of the Twenty-Fifth Annual ACM Symposium on Theory of Computing, STOC '93, pages 71-80, New York, NY, USA, 1993. Association for Computing Machinery. URL: https://doi.org/10.1145/167088.167115.
  7. Brenda S. Baker. Parameterized duplication in strings: Algorithms and an application to software maintenance. SIAM Journal on Computing, 26:1343-1362, 1997. Google Scholar
  8. Brenda S. Baker. Parameterized diff. In Proceedings of the Tenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '99, pages 854-855, USA, 1999. Society for Industrial and Applied Mathematics. Google Scholar
  9. Pablo Barceló, Leonid Libkin, and Juan L. Reutter. Querying graph patterns. In Maurizio Lenzerini and Thomas Schwentick, editors, Proceedings of the 30th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS 2011), pages 199-210. ACM, 2011. URL: https://doi.org/10.1145/1989284.1989307.
  10. Pablo Barceló, Juan Reutter, and Leonid Libkin. Parameterized regular expressions and their languages. Theoretical Computer Science, 474:21-45, 2013. URL: https://doi.org/10.4230/LIPIcs.FSTTCS.2011.351.
  11. Pablo Barceló, Leonid Libkin, and Juan Reutter. Parameterized regular expressions and their languages. Theoretical Computer Science, 474:21-45, 2011. URL: https://doi.org/10.1016/j.tcs.2012.12.036.
  12. Michael Benedikt, Gabriele Puppis, and Cristian Riveros. The cost of traveling between languages. In Luca Aceto, Monika Henzinger, and Jiří Sgall, editors, Automata, Languages and Programming, pages 234-245, Berlin, Heidelberg, 2011. Springer Berlin Heidelberg. Google Scholar
  13. Michael Benedikt, Gabriele Puppis, and Cristian Riveros. Regular repair of specifications. In 2011 IEEE 26th Annual Symposium on Logic in Computer Science, pages 335-344, 2011. URL: https://doi.org/10.1109/LICS.2011.43.
  14. Michael Benedikt, Gabriele Puppis, and Cristian Riveros. Bounded repairability of word languages. Journal of Computer and System Sciences, 79(8):1302-1321, 2013. URL: https://doi.org/10.1016/j.jcss.2013.06.001.
  15. Francine Blanchet-Sadri. Algorithmic Combinatorics on Partial Words (Discrete Mathematics and Its Applications). Chapman, Hall/CRC, 2007. Google Scholar
  16. William W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. SIGMOD Rec., 27(2):201-212, 1998. URL: https://doi.org/10.1145/276305.276323.
  17. Richard Cole, Carmit Hazay, Moshe Lewenstein, and Dekel Tsur. Two-dimensional parameterized matching. ACM Trans. Algorithms, 11(2), October 2014. URL: https://doi.org/10.1145/2650220.
  18. Jessica Davies. Solving MAXSAT by Decoupling Optimization and Satisfaction. PhD thesis, University of Toronto, Canada, 2014. URL: http://hdl.handle.net/1807/43539.
  19. Arnab Ganguly, Rahul Shah, and Sharma V. Thankachan. Pbwt: Achieving succinct data structures for parameterized pattern matching and related problems. In Proceedings of the 2017 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 397-407, 2017. URL: https://doi.org/10.1137/1.9781611974782.25.
  20. M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness (Series of Books in the Mathematical Sciences). W. H. Freeman, first edition edition, 1979. Google Scholar
  21. Pawel Gawrychowski, Florin Manea, and Stefan Siemer. Matching patterns with variables under hamming distance. In Filippo Bonchi and Simon J. Puglisi, editors, 46th International Symposium on Mathematical Foundations of Computer Science, MFCS 2021, August 23-27, 2021, Tallinn, Estonia, volume 202 of LIPIcs, pages 48:1-48:24. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021. URL: https://doi.org/10.4230/LIPIcs.MFCS.2021.48.
  22. Pawel Gawrychowski, Florin Manea, and Stefan Siemer. Matching patterns with variables under edit distance. In Diego Arroyuelo and Barbara Poblete, editors, String Processing and Information Retrieval - 29th International Symposium, SPIRE 2022, Concepción, Chile, November 8-10, 2022, Proceedings, volume 13617 of Lecture Notes in Computer Science, pages 275-289. Springer, 2022. URL: https://doi.org/10.1007/978-3-031-20643-6_20.
  23. Dan Gusfield. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology. Cambridge University Press, 1997. URL: https://www.wikidata.org/entity/Q55980413.
  24. Carmit Hazay, Moshe Lewenstein, and Dina Sokol. Approximate parameterized matching. ACM Trans. Algorithms, 3(3):29-es, 2007. URL: https://doi.org/10.1145/1273340.1273345.
  25. Wilbert Jan Heeringa. Measuring dialect pronunciation differences using Levenshtein distance. PhD thesis, University of Groningen, 2004. Google Scholar
  26. Orgad Keller, Tsvi Kopelowitz, and Moshe Lewenstein. On the longest common parameterized subsequence. Theoretical Computer Science, 410(51):5347-5353, 2009. URL: https://doi.org/10.1016/j.tcs.2009.09.011.
  27. Lillian Jane Lee. Similarity-based approaches to natural language processing. PhD thesis, Harvard University, 1997. Google Scholar
  28. Vladimir I Levenshtein et al. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10(8), pages 707-710. Soviet Union, 1966. Google Scholar
  29. Moshe Lewenstein. Parameterized Matching, pages 635-638. Springer US, Boston, MA, 2008. URL: https://doi.org/10.1007/978-0-387-30162-4_282.
  30. Florin Manea and Markus L. Schmid. Matching patterns with variables. In Robert Mercas and Daniel Reidenbach, editors, Combinatorics on Words - 12th International Conference, WORDS 2019, Loughborough, UK, September 9-13, 2019, Proceedings, volume 11682 of Lecture Notes in Computer Science, pages 1-27. Springer, 2019. URL: https://doi.org/10.1007/978-3-030-28796-2_1.
  31. Juan Mendivelso, Sharma V. Thankachan, and Yoan Pinzón. A brief history of parameterized matching problems. Discrete Applied Mathematics, 274:103-115, 2020. Stringology Algorithms. URL: https://doi.org/10.1016/j.dam.2018.07.017.
  32. Mehryar Mohri. Edit-distance of weighted automata: General definitions and algorithms. International Journal of Foundations of Computer Science, 14(06):957-982, 2003. Google Scholar
  33. Saul B Needleman and Christian D Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3):443-453, 1970. Google Scholar
  34. Sascha Schimke, Claus Vielhauer, and Jana Dittmann. Using adapted levenshtein distance for on-line signature authentication. In Proceedings of the 17th International Conference on Pattern Recognition (ICPR 2004), volume 2, pages 931-934. IEEE, 2004. Google Scholar
  35. T. Shibuya. Generalization of a suffix tree for RNA structural pattern matching. Algorithmica (New York), 39(1):1-19, 2004. URL: https://doi.org/10.1007/s00453-003-1067-9.
  36. Bernd Voigt. The partition problem for finite abelian groups. Journal of Combinatorial Theory, Series A, 28(3):257-271, 1980. Google Scholar
  37. Robert A. Wagner and Michael J. Fischer. The string-to-string correction problem. J. ACM, 21(1):168-173, 1974. URL: https://doi.org/10.1145/321796.321811.
  38. Kaizhong Zhang and Dennis Shasha. Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal on Computing, 18(6):1245-1262, 1989. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail