The Fine-Grained Complexity of Episode Matching

Bille, Philip; Gørtz, Inge Li; Mozes, Shay; Steiner, Teresa Anna; Weimann, Oren

doi:10.4230/LIPIcs.CPM.2022.4

Abstract

Given two strings S and P, the Episode Matching problem is to find the shortest substring of S that contains P as a subsequence. The best known upper bound for this problem is Õ(nm) by Das et al. (1997), where n,m are the lengths of S and P, respectively. Although the problem is well studied and has many applications in data mining, this bound has never been improved. In this paper we show why this is the case by proving that no O((nm)^{1-ε}) algorithm (even for binary strings) exists, unless the Strong Exponential Time Hypothesis (SETH) is false. We then consider the indexing version of the problem, where S is preprocessed into a data structure for answering episode matching queries P. We show that for any τ, there is a data structure using O(n+(n/(τ)) ^k) space that answers episode matching queries for any P of length k in O(k⋅ τ ⋅ log log n) time. We complement this upper bound with an almost matching lower bound, showing that any data structure that answers episode matching queries for patterns of length k in time O(n^δ), must use Ω(n^{k-kδ-o(1)}) space, unless the Strong k-Set Disjointness Conjecture is false. Finally, for the special case of k = 2, we present a faster construction of the data structure using fast min-plus multiplication of bounded integer matrices.

Amir Abboud, Arturs Backurs, and Virginia Vassilevska Williams. Tight hardness results for LCS and other sequence similarity measures. In Proc. 56th FOCS, pages 59-78, 2015.
Amir Abboud and Virginia Vassilevska Williams. Fine-grained hardness for edit distance to a fixed sequence. In Proc. 48th ICALP, volume 198, pages 7:1-7:14, 2021.
Amir Abboud, Virginia Vassilevska Williams, and Oren Weimann. Consequences of faster alignment of sequences. In Proc. 41st ICALP, pages 39-51, 2014.
Avinash Achar, A. Ibrahim, and P. S. Sastry. Pattern-growth based frequent serial episode discovery. Data Knowl. Eng., 87:91-108, 2013.
Josh Alman and Virginia Vassilevska Williams. A refined laser method and faster matrix multiplication. In Dániel Marx, editor, Proc. 32nd SODA, pages 522-539, 2021.
Alberto Apostolico and Mikhail J. Atallah. Compact recognizers of episode sequences. Inf. Comput., 174(2):180-192, 2002.
Mikhail J. Atallah, Robert Gwadera, and Wojciech Szpankowski. Detection of significant sets of episodes in event sequences. In Proc. 4th ICDM, pages 3-10, 2004.
Arturs Backurs and Piotr Indyk. Which regular expression patterns are hard to match? In Proc. 57th FOCS, pages 457-466, 2016.
Arturs Backurs and Piotr Indyk. Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). SIAM J. Comput., 47(3):1087-1097, 2018.
Philip Bille, Inge Li Gørtz, Max Rishøj Pedersen, and Teresa Anna Steiner. Gapped indexing for consecutive occurrences. In Proc. 32nd CPM, pages 10:1-10:19, 2021.
Luc Boasson, Patrick Cégielski, Irène Guessarian, and Yuri V. Matiyasevich. Window-accumulated subsequence matching problem is linear. Ann. Pure Appl. Log., 113(1-3):59-80, 2001.
Karl Bringmann and Marvin Künnemann. Quadratic conditional lower bounds for string problems and dynamic time warping. In Proc. 56th FOCS, pages 79-97, 2015.
Patrick Cégielski, Irène Guessarian, and Yuri V. Matiyasevich. Multiple serial episodes matching. Inf. Process. Lett., 98(6):211-218, 2006.
Maxime Crochemore, Costas S. Iliopoulos, Christos Makris, Wojciech Rytter, Athanasios K. Tsakalidis, and T. Tsichlas. Approximate string matching with gaps. Nord. J. Comput., 9(1):54-65, 2002.
Gautam Das, Rudolf Fleischer, Leszek Gasieniec, Dimitrios Gunopulos, and Juha Kärkkäinen. Episode matching. In Proc. 8th CPM, pages 12-27, 1997.
Lech Duraj, Marvin Künnemann, and Adam Polak. Tight conditional lower bounds for longest common increasing subsequence. Algorithmica, 81(10):3968-3992, 2019.
Massimo Equi, Roberto Grossi, Veli Mäkinen, and Alexandru I. Tomescu. On the complexity of string matching for graphs. In Proc. 46th ICALP, pages 55:1-55:15, 2019.
Massimo Equi, Veli Mäkinen, and Alexandru I. Tomescu. Graphs cannot be indexed in polynomial time for sub-quadratic time string matching, unless SETH fails. In Proc. 27th SOFSEM, volume 12607, pages 608-622, 2021.
Daniel Gibney. An efficient elastic-degenerate text index? not likely. In Proc. 27th SPIRE, pages 76-88, 2020.
Isaac Goldstein, Tsvi Kopelowitz, Moshe Lewenstein, and Ely Porat. Conditional lower bounds for space/time tradeoffs. In Proc. 15th WADS, pages 421-436, 2017.
Robert Gwadera, Mikhail J. Atallah, and Wojciech Szpankowski. Markov models for identification of significant episodes. In Proc. 5th SDM, pages 404-414, 2005.
Robert Gwadera, Mikhail J. Atallah, and Wojciech Szpankowski. Reliable detection of episodes in event sequences. Knowl. Inf. Syst., 7(4):415-437, 2005.
Masahiro Hirao, Shunsuke Inenaga, Ayumi Shinohara, Masayuki Takeda, and Setsuo Arikawa. A practical algorithm to find the best episode patterns. In Proc. 4th DS, pages 435-440, 2001.
Daniel S. Hirschberg. Algorithms for the longest common subsequence problem. J. ACM, 24(4):664-675, 1977.
Russell Impagliazzo and Ramamohan Paturi. On the complexity of k-sat. J. Comput. Syst. Sci., 62(2):367-375, 2001.
Mika Klemettinen, Heikki Mannila, and Hannu Toivonen. Rule discovery in telecommunication alarm data. J. Netw. Syst. Manag., 7(4):395-423, 1999.
Tomasz Kociumaka, Jakub Radoszewski, and Tatiana Starikovskaya. Longest common substring with approximately k mismatches. Algorithmica, 81(6):2633-2652, 2019.
Tsvi Kopelowitz and Robert Krauthgamer. Color-distance oracles and snippets. In Proc. 27th CPM, pages 24:1-24:10, 2016.
Josué Kuri, Gonzalo Navarro, and Ludovic Mé. Fast multipattern search algorithms for intrusion detection. Fundam. Informaticae, 56(1-2):23-49, 2003.
Veli Mäkinen, Gonzalo Navarro, and Esko Ukkonen. Transposition invariant string matching. J. Algorithms, 56(2):124-153, 2005.
Heikki Mannila, Hannu Toivonen, and A. Inkeri Verkamo. Discovery of frequent episodes in event sequences. Data Min. Knowl. Discov., 1(3):259-289, 1997.
Elżbieta Nowicka and Marcin Zawada. On the complexity of matching non-injective general episodes. Computation and Logic in the Real World, pages 288-296, 2007.
Adam Polak. Why is it hard to beat O(n^2) for longest common weakly increasing subsequence? Inf. Process. Lett., 132:1-5, 2018.
Shinichiro Tago, Tatsuya Asai, Takashi Katoh, Hiroaki Morikawa, and Hiroya Inakoshi. EVIS: A fast and scalable episode matching engine for massively parallel data streams. In Proc. 17th DASFAA, pages 213-223, 2012.
D. E. Willard. Log-logarithmic worst-case range queries are possible in space θ(N). Inf. Process. Lett., 17(2):81-84, 1983.
Ryan Williams. A new algorithm for optimal 2-constraint satisfaction and its implications. Theor. Comput. Sci., 348(2-3):357-365, 2005.
Virginia Vassilevska Williams and Yinzhan Xu. Truly subcubic min-plus product for less structured matrices, with applications. In Proc.31st SODA, pages 12-29, 2020.

The Fine-Grained Complexity of Episode Matching

Authors Philip Bille , Inge Li Gørtz , Shay Mozes , Teresa Anna Steiner , Oren Weimann

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

The Fine-Grained Complexity of Episode Matching

Authors Philip Bille , Inge Li Gørtz , Shay Mozes , Teresa Anna Steiner , Oren Weimann

File

Document Identifiers

Author Details

Funding

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References