Pattern Discovery in Colored Strings

Lipták, Zsuzsanna; Puglisi, Simon J.; Rossi, Massimiliano

doi:10.4230/LIPIcs.SEA.2020.12

File

Author Details

Zsuzsanna Lipták

Department of Computer Science, University of Verona, Italy

Simon J. Puglisi

Department of Computer Science, University of Helsinki, Finland

Massimiliano Rossi

Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA

Cite AsGet BibTex

Zsuzsanna Lipták, Simon J. Puglisi, and Massimiliano Rossi. Pattern Discovery in Colored Strings. In 18th International Symposium on Experimental Algorithms (SEA 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 160, pp. 12:1-12:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)
https://doi.org/10.4230/LIPIcs.SEA.2020.12

Abstract

We consider the problem of identifying patterns of interest in colored strings. A colored string is a string in which each position is colored with one of a finite set of colors. Our task is to find substrings that always occur followed by the same color at the same distance. The problem is motivated by applications in embedded systems verification, in particular, assertion mining. The goal there is to automatically infer properties of the embedded system from the analysis of its simulation traces. We show that the number of interesting patterns is upper-bounded by 𝒪(n²) where n is the length of the string. We introduce a baseline algorithm with 𝒪(n²) running time which identifies all interesting patterns for all colors in the string satisfying certain minimality conditions. When one is interested in patterns related to only one color, we provide an algorithm that identifies patterns in 𝒪(n²log n) time, but is faster than the first algorithm in practice, both on simulated and on real-world patterns.

Subject Classification

ACM Subject Classification

Theory of computation → Design and analysis of algorithms

Keywords

property testing
suffix tree
pattern mining

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules. In Proc. 20th International Conference on Very Large Data Bases (VLDB), volume 1215, pages 487-499. Morgan Kaufmann, 1994.
Rakesh Agrawal and Ramakrishnan Srikant. Mining sequential patterns. In Proc. Eleventh International Conference on Data Engineering (ICDE), volume 95, pages 3-14. IEEE Computer Society, 1995.
Franc Brglez, David Bryan, and Krzysztof Kozminski. Combinational profiles of sequential benchmark circuits. In Proc. 1989 IEEE international symposium on circuits and systems (ISACS), volume 3, pages 1929-1934. IEEE, 1989.
Chung-Wen Cho, Ying Zheng, Yi-Hung Wu, and Arbee LP Chen. A tree-based approach for event prediction using episode rules over event streams. In Proc. International Conference on Database and Expert Systems Applications (DEXA 2008), volume 5181 of LNCS, pages 225-240. Springer, 2008.
David Clark. Compact PAT trees. PhD thesis, University of Waterloo, 1997.
Fulvio Corno, Matteo Sonza Reorda, and Giovanni Squillero. Rt-level itc'99 benchmarks and first atpg results. IEEE Design & Test of computers, 17(3):44-53, 2000.
Alessandro Danese, Nicolò Dalla Riva, and Graziano Pravadelli. A-team: Automatic template-based assertion miner. In Proc. 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC), pages 1-6. IEEE, 2017.
Alessandro Danese, Tara Ghasempouri, and Graziano Pravadelli. Automatic extraction of assertions from execution traces of behavioural models. In Proc. 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 67-72. IEEE, 2015.
Jasbir Dhaliwal, Simon J Puglisi, and Andrew Turpin. Practical efficient string mining. IEEE Transactions on Knowledge and Data Engineering, 24(4):735-744, 2010.
Lina Fahed, Armelle Brun, and Anne Boyer. DEER: Distant and Essential Episode Rules for early prediction. Expert Systems with Applications, 93:283-298, 2018.
Johannes Fischer and Volker Heun. Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM J. Comput., 40(2):465-492, 2011.
Johannes Fischer, Volker Heun, and Stefan Kramer. Fast frequent string mining using suffix arrays. In Proc. Fifth IEEE International Conference on Data Mining (ICDM 2005), pages 609-612. IEEE, 2005.
Johannes Fischer, Volker Heun, and Stefan Kramer. Optimal string mining under frequency constraints. In Proc. European Conference on Principles of Data Mining and Knowledge Discovery (PKDD 2006), volume 4213 of LNCS, pages 139-150. Springer, 2006.
Johannes Fischer, Veli Mäkinen, and Niki Välimäki. Space efficient string mining under frequency constraints. In Proc. Eighth IEEE International Conference on Data Mining (ICDM 2008), pages 193-202. IEEE, 2008.
Harry D Foster, Adam C Krolnik, and David J Lacey. Assertion-based design. Springer Science & Business Media, 2004.
Philippe Fournier-Viger, Jerry Chun-Wei Lin, Rage Uday Kiran, Yun Sing Koh, and Rincy Thomas. A survey of sequential pattern mining. Data Science and Pattern Recognition, 1(1):54-77, 2017.
Simon Gog, Timo Beller, Alistair Moffat, and Matthias Petri. From Theory to Practice: Plug and Play with Succinct Data Structures. In 13th International Symposium on Experimental Algorithms, (SEA 2014), volume 8504 of LNCS, pages 326-337. Springer, 2014.
Dan Gusfield. Algorithms on Strings, Trees, and Sequences : Computer Science and Computational Biology. Cambridge University Press, Cambridge, United Kingdom, 1997.
Koji Iwanuma, Ryuichi Ishihara, Yoh Takano, and Hidetomo Nabeshima. Extracting frequent subsequences from a single long data sequence a novel anti-monotonic measure and a simple on-line algorithm. In Proc. Fifth IEEE International Conference on Data Mining (ICDM 2005), pages 186-195. IEEE, 2005.
Xiaonan Ji, James Bailey, and Guozhu Dong. Mining minimal distinguishing subsequence patterns with gap constraints. Knowledge and Information Systems, 11(3):259-286, 2007.
Srivatsan Laxman, Vikram Tankasali, and Ryen W White. Stream prediction using a generative model based on frequent episodes in event sequences. In Proc. 14th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD 2008), pages 453-461. ACM, 2008.
Lingyi Liu, David Sheridan, Viraj Athavale, and Shobha Vasudevan. Automatic generation of assertions from system level design using data mining. In Proc. Ninth ACM/IEEE International Conference on Formal Methods and Models for Codesign, pages 191-200. IEEE Computer Society, 2011.
Nizar R Mabroukeh and Christie I Ezeife. A taxonomy of sequential pattern mining algorithms. ACM Computing Surveys (CSUR), 43(1):3, 2010.
Veli Mäkinen, Djamal Belazzougui, Fabio Cunial, and Alexandru I. Tomescu. Genome-Scale Algorithm Design. CUP, 2015.
Heikki Mannila, Hannu Toivonen, and A Inkeri Verkamo. Discovery of frequent episodes in event sequences. Data mining and knowledge discovery, 1(3):259-289, 1997.
OpenCores. Available at https://opencores.org/. Accessed 05-03-2019. URL: https://opencores.org/.
Tinghai Pang, Lei Duan, Jesse Li-Ling, and Guozhu Dong. Mining Similarity-Aware Distinguishing Sequential Patterns from Biomedical Sequences. In Proc. Second International Conference on Data Science in Cyberspace (DSC 2017), pages 43-52. IEEE, 2017.
Jian Pei, Jiawei Han, and Wei Wang. Constraint-based sequential pattern mining: the pattern-growth methods. Journal of Intelligent Information Systems, 28(2):133-160, 2007.
Robert Sedgewick and Kevin Wayne. Algorithms. Addison-Wesley Professional, 2011.
Bill Smyth. Computing Patterns in Strings. Pearson Addison-Wesley, Essex, England, 2003.
Niko Välimäki and Simon J Puglisi. Distributed string mining for high-throughput sequencing data. In Proc. 12th International Workshop on Algorithms in Bioinformatics (WABI 2012), volume 7534 of LNCS, pages 441-452. Springer, 2012.
Shobha Vasudevan, David Sheridan, Sanjay Patel, David Tcheng, Bill Tuohy, and Daniel Johnson. Goldmine: Automatic assertion generation using data mining and static analysis. In Proc. 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010), pages 626-629. IEEE, 2010.
Xianming Wang, Lei Duan, Guozhu Dong, Zhonghua Yu, and Changjie Tang. Efficient mining of density-aware distinguishing sequential patterns with gap constraints. In International Conference on Database Systems for Advanced Applications (DASFAA 2014), volume 8421 of LNCS, pages 372-387. Springer, 2014.
Xindong Wu, Xingquan Zhu, Yu He, and Abdullah N Arslan. PMBC: Pattern mining from biological sequences with wildcard constraints. Computers in Biology and Medicine, 43(5):481-492, 2013.

Pattern Discovery in Colored Strings

Authors Zsuzsanna Lipták , Simon J. Puglisi , Massimiliano Rossi

File

Document Identifiers

Author Details

Acknowledgements

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Pattern Discovery in Colored Strings

Authors Zsuzsanna Lipták , Simon J. Puglisi , Massimiliano Rossi

File

Document Identifiers

Author Details

Funding

Acknowledgements

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Related Versions

Supplementary Materials

References