Finding Diverse Strings and Longest Common Subsequences in a Graph

Authors Yuto Shida, Giulia Punzi , Yasuaki Kobayashi , Takeaki Uno , Hiroki Arimura



PDF
Thumbnail PDF

File

LIPIcs.CPM.2024.27.pdf
  • Filesize: 1.29 MB
  • 19 pages

Document Identifiers

Author Details

Yuto Shida
  • Hokkaido University, Japan
Giulia Punzi
  • National Institute of Informatics, Tokyo, Japan
Yasuaki Kobayashi
  • Hokkaido University, Japan
Takeaki Uno
  • National Institute of Informatics, Tokyo, Japan
Hiroki Arimura
  • Hokkaido University, Japan

Acknowledgements

The authors express sincere thanks to anonymous reviewers for their valuable comments, which significantly improved the presentation and quality of this paper. The last author would like to thank Norihito Yasuda, Tesshu Hanaka, Kazuhiro Kurita, Hirotaka Ono of AFSA project, and Shinji Ito for fruitful discussions and helpful comments.

Cite AsGet BibTex

Yuto Shida, Giulia Punzi, Yasuaki Kobayashi, Takeaki Uno, and Hiroki Arimura. Finding Diverse Strings and Longest Common Subsequences in a Graph. In 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 296, pp. 27:1-27:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.CPM.2024.27

Abstract

In this paper, we study for the first time the Diverse Longest Common Subsequences (LCSs) problem under Hamming distance. Given a set of a constant number of input strings, the problem asks to decide if there exists some subset X of K longest common subsequences whose diversity is no less than a specified threshold Δ, where we consider two types of diversities of a set X of strings of equal length: the Sum diversity and the Min diversity defined as the sum and the minimum of the pairwise Hamming distance between any two strings in X, respectively. We analyze the computational complexity of the respective problems with Sum- and Min-diversity measures, called the Max-Sum and Max-Min Diverse LCSs, respectively, considering both approximation algorithms and parameterized complexity. Our results are summarized as follows. When K is bounded, both problems are polynomial time solvable. In contrast, when K is unbounded, both problems become NP-hard, while Max-Sum Diverse LCSs problem admits a PTAS. Furthermore, we analyze the parameterized complexity of both problems with combinations of parameters K and r, where r is the length of the candidate strings to be selected. Importantly, all positive results above are proven in a more general setting, where an input is an edge-labeled directed acyclic graph (DAG) that succinctly represents a set of strings of the same length. Negative results are proven in the setting where an input is explicitly given as a set of strings. The latter results are equipped with an encoding such a set as the longest common subsequences of a specific input string set.

Subject Classification

ACM Subject Classification
  • Theory of computation → Design and analysis of algorithms
Keywords
  • Sequence analysis
  • longest common subsequence
  • Hamming distance
  • dispersion
  • approximation algorithms
  • parameterized complexity

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Noga Alon, Raphael Yuster, and Uri Zwick. Color-coding. J. ACM, 42(4):844-856, 1995. Google Scholar
  2. Emmanuel Arrighi, Henning Fernau, Mateus de Oliveira Oliveira, and Petra Wolf. Synchronization and diversity of solutions. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37(10), pages 11516-11524, 2023. Google Scholar
  3. Giorgio Ausiello, Pierluigi Crescenzi, Giorgio Gambosi, Viggo Kann, Alberto Marchetti-Spaccamela, and Marco Protasi. Complexity and Approximation: Combinatorial Optimization Problems and Their Approximability Properties. Springer, 2012. Google Scholar
  4. Giorgio Ausiello, Alessandro D'Atri, and Marco Protasi. Structure preserving reductions among convex optimization problems. JCSS, 21(1):136-153, 1980. Google Scholar
  5. Julien Baste, Michael R. Fellows, Lars Jaffke, Tomáš Masařík, Mateus de Oliveira Oliveira, Geevarghese Philip, and Frances A Rosamond. Diversity of solutions: An exploration through the lens of fixed-parameter tractability theory. Artificial Intelligence, 303:103644, 2022. Google Scholar
  6. Julien Baste, Lars Jaffke, Tomáš Masařík, Geevarghese Philip, and Günter Rote. FPT algorithms for diverse collections of hitting sets. Algorithms, 12(12):254, 2019. Google Scholar
  7. Benjamin Birnbaum and Kenneth J. Goldman. An improved analysis for a greedy remote-clique algorithm using factor-revealing LPs. Algorithmica, 55(1):42-59, 2009. Google Scholar
  8. Hans L. Bodlaender, Rodney G. Downey, Michael R. Fellows, and Harold T. Wareham. The parameterized complexity of sequence alignment and consensus. Theoretical Computer Science, 147(1-2):31-54, 1995. Google Scholar
  9. Laurent Bulteau, Mark Jones, Rolf Niedermeier, and Till Tantau. An FPT-algorithm for longest common subsequence parameterized by the maximum number of deletions. In 33rd Ann. Symp. on Combinatorial Pattern Matching (CPM 2022), LIPIcs, volume 223, pages 6:1-6:11, 2022. Google Scholar
  10. Alfonso Cevallos, Friedrich Eisenbrand, and Rico Zenklusen. An improved analysis of local search for max-sum diversification. Math. Oper. Research, 44(4):1494-1509, 2019. Google Scholar
  11. Barun Chandra and Magnús M Halldórsson. Approximation algorithms for dispersion problems. J. of Algorithms, 38(2):438-465, 2001. Google Scholar
  12. Alessio Conte, Roberto Grossi, Giulia Punzi, and Takeaki Uno. A compact DAG for storing and searching maximal common subsequences. In ISAAC 2023, pages 21:1-21:15, 2023. Google Scholar
  13. Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, fourth edition. MIT Press, 2022. Google Scholar
  14. Graham Cormode and Shan Muthukrishnan. The string edit distance matching problem with moves. ACM Transactions on Algorithms (TALG), 3(1):1-19, 2007. Google Scholar
  15. Marek Cygan, Fedor V. Fomin, Lukasz Kowalik, Daniel Lokshtanov, Daniel Marx, Marcin Pilipczuk, Michal Pilipczuk, and Saket Saurabh. Parameterized Algorithms. Springer, 2015. Google Scholar
  16. Michel Deza and Monique Laurent. Geometry of Cuts and Metrics, volume 15 of Algorithms and Combinatorics. Springer, 1997. Google Scholar
  17. Rodney G. Downey and Michael R. Fellows. Parameterized complexity. Springer, 2012. Google Scholar
  18. Erhan Erkut. The discrete p-dispersion problem. European Journal of Operational Research, 46(1):48-60, 1990. Google Scholar
  19. Michael R. Fellows, Christian Knauer, Naomi Nishimura, Prabhakar Ragde, Frances A. Rosamond, Ulrike Stege, Dimitrios M. Thilikos, and Sue Whitesides. Faster fixed-parameter tractable algorithms for matching and packing problems. Algorithmica, 52:167-176, 2008. Google Scholar
  20. Michael R. Fellows and Frances A. Rosamond. The DIVERSE X Paradigm (Open problems). In Henning Fernau, Petr Golovach, Marie-France Sagot, et al., editors, Algorithmic enumeration: Output-sensitive, input-sensitive, parameterized, approximative (Dagstuhl Seminar 18421), Dagstuhl Reports, 8(10), 2019. Google Scholar
  21. Dana Fisman, Joshua Grogin, Oded Margalit, and Gera Weiss. The normalized edit distance with uniform operation costs is a metric. In 33rd Ann. Symp. on Combinatorial Pattern Matching (CPM 2022), LIPIcs, volume 223, pages 17:1-17:17, 2022. Google Scholar
  22. Jörg Flum and Martin Grohe. Parameterized Complexity Theory (Texts in Theoretical Computer Science. An EATCS Series). Springer, 2006. Google Scholar
  23. Michael R. Garey and David S. Johnson. Computers and intractability: A guide to the theory of NP-completeness, 1979. Google Scholar
  24. Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Similarity search in high dimensions via hashing. In VLDB, volume 99(6), pages 518-529, 1999. Google Scholar
  25. Dan Gusfield. Efficient methods for multiple sequence alignment with guaranteed error bounds. Bulletin of Mathematical Biology, 55(1):141-154, 1993. Google Scholar
  26. Koji Hakata and Hiroshi Imai. The longest common subsequence problem for small alphabet size between many strings. In ISAAC'92, pages 469-478. Springer, 1992. Google Scholar
  27. Tesshu Hanaka, Masashi Kiyomi, Yasuaki Kobayashi, Yusuke Kobayashi, Kazuhiro Kurita, and Yota Otachi. A framework to design approximation algorithms for finding diverse solutions in combinatorial problems. In AAAI 2023, pages 3968-3976, 2023. Google Scholar
  28. Tesshu Hanaka, Yasuaki Kobayashi, Kazuhiro Kurita, and Yota Otachi. Finding diverse trees, paths, and more. In AAAI 2021, pages 3778-3786, 2021. Google Scholar
  29. Pierre. Hansen and I.D. Moon. Dispersing facilities on a network. In the TIMS/ORSA Joint National Meeting, Washington, D.C. RUTCOR, Rutgers University., 1988. Presentation. Google Scholar
  30. Miyuji Hirota and Yoshifumi Sakai. Efficient algorithms for enumerating maximal common subsequences of two strings. CoRR, abs/2307.10552, 2023. URL: https://arxiv.org/abs/2307.10552.
  31. Daniel S Hirschberg. Recent results on the complexity of common-subsequence problems, in Time warps, String edits, and Macromolecules, pages 323-328. Addison-Wesley, 1983. Google Scholar
  32. John E. Hopcroft and Jeff D. Ullman. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, 1979. Google Scholar
  33. Robert W. Irving and Campbell B. Fraser. Two algorithms for the longest common subsequence of three (or more) strings. In CPM 1992, pages 214-229. Springer, 1992. Google Scholar
  34. Michael J. Kuby. Programming models for facility dispersion: The p-dispersion and maxisum dispersion problems. Geographical Analysis, 19(4):315-329, 1987. Google Scholar
  35. Vladimir I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics doklady, 10(8):707-710, 1966. Google Scholar
  36. Ohad Lipsky and Ely Porat. L1 pattern matching lower bound. Information Processing Letters, 105(4):141-143, 2008. Google Scholar
  37. Mi Lu and Hua Lin. Parallel algorithms for the longest common subsequence problem. IEEE Transactions on Parallel and Distributed Systems, 5(8):835-848, 1994. Google Scholar
  38. David Maier. The complexity of some problems on subsequences and supersequences. J. ACM, 25(2):322-336, 1978. Google Scholar
  39. Tatsuya Matsuoka and Shinji Ito. Maximization of minimum weighted Hamming distance between set pairs. In Asian Conference on Machine Learning, pages 895-910. PMLR, 2024. Google Scholar
  40. Sekharipuram S. Ravi, Daniel J. Rosenkrantz, and Giri Kumar Tayi. Heuristic and special case algorithms for dispersion problems. Operations research, 42(2):299-310, 1994. Google Scholar
  41. David Sankoff. Matching sequences under deletion/insertion constraints. Proceedings of the National Academy of Sciences, 69(1):4-6, 1972. Google Scholar
  42. Douglas R. Shier. A min-max theorem for p-center problems on a tree. Transportation Science, 11(3):243-252, 1977. Google Scholar
  43. Vijay V. Vazirani. Approximation Algorithms. Springer, 2010. Google Scholar
  44. Robert A. Wagner and Michael J. Fischer. The string-to-string correction problem. J. ACM, 21(1):168-173, 1974. Google Scholar
  45. Da-Wei Wang and Yue-Sun Kuo. A study on two geometric location problems. Information Processing Letters, 28(6):281-286, 1988. Google Scholar
  46. Dan E. Willard. Log-logarithmic worst-case range queries are possible in space θ(n). Information Processing Letters, 17(2):81-84, 1983. Google Scholar