Output-Sensitive Pattern Extraction in Sequences

Grossi, Roberto; Menconi, Giulia; Pisanti, Nadia; Trani, Roberto; Vind, Soren

doi:10.4230/LIPIcs.FSTTCS.2014.303

File

LIPIcs.FSTTCS.2014.303.pdf

Filesize: 0.55 MB
12 pages

Document Identifiers

DOI: 10.4230/LIPIcs.FSTTCS.2014.303
URN: urn:nbn:de:0030-drops-48513

Author Details

Roberto Grossi

Giulia Menconi

Nadia Pisanti

Roberto Trani

Soren Vind

Cite AsGet BibTex

Roberto Grossi, Giulia Menconi, Nadia Pisanti, Roberto Trani, and Soren Vind. Output-Sensitive Pattern Extraction in Sequences. In 34th International Conference on Foundation of Software Technology and Theoretical Computer Science (FSTTCS 2014). Leibniz International Proceedings in Informatics (LIPIcs), Volume 29, pp. 303-314, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2014)
https://doi.org/10.4230/LIPIcs.FSTTCS.2014.303

Abstract

Genomic Analysis, Plagiarism Detection, Data Mining, Intrusion Detection, Spam Fighting and Time Series Analysis are just some examples of applications where extraction of recurring patterns in sequences of objects is one of the main computational challenges. Several notions of patterns exist, and many share the common idea of strictly specifying some parts of the pattern and to don't care about the remaining parts. Since the number of patterns can be exponential in the length of the sequences, pattern extraction focuses on statistically relevant patterns, where any attempt to further refine or extend them causes a loss of significant information (where the number of occurrences changes). Output-sensitive algorithms have been proposed to enumerate and list these patterns, taking polynomial time O(n^c) per pattern for constant c > 1, which is impractical for massive sequences of very large length n. We address the problem of extracting maximal patterns with at most k don't care symbols and at least q occurrences. Our contribution is to give the first algorithm that attains a stronger notion of output-sensitivity, borrowed from the analysis of data structures: the cost is proportional to the actual number of occurrences of each pattern, which is at most n and practically much smaller than n in real applications, thus avoiding the aforementioned cost of O(n^c) per pattern.

Keywords

Pattern Extraction
Motif Detection
Pattern Discovery
Motif Trie

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

Mohamed Ibrahim Abouelhoda and Moustafa Ghanem. String mining in bioinformatics. In Scientific Data Mining and Knowledge Discovery, pages 207-247. Springer, 2010.
Mohamed Ibrahim Abouelhoda, Stefan Kurtz, and Enno Ohlebusch. Replacing suffix trees with enhanced suffix arrays. JDA, 2(1):53-86, 2004.
Alberto Apostolico, Matteo Comin, and Laxmi Parida. Bridging lossy and lossless compression by motif pattern discovery. In General Theory of Information Transfer and Combinatorics, pages 793-813. Springer, 2006.
Hiroki Arimura and Takeaki Uno. An efficient polynomial space and polynomial delay algorithm for enumeration of maximal motifs in a sequence. JCO, 2007.
Brenda S Baker. On finding duplication and near-duplication in large software systems. In Proc. 2nd WCRE, pages 86-95, 1995.
Sergey Brin, James Davis, and Héctor García-Molina. Copy detection mechanisms for digital documents. SIGMOD Rec., 24(2):398-409, May 1995.
Chia-Hui Chang, Chun-Nan Hsu, and Shao-Cheng Lui. Automatic information extraction from semi-structured web pages by pattern discovery. Decis Support Syst, 34(1):129-147, 2003.
Xin Chen, Brent Francia, Ming Li, Brian Mckinnon, and Amit Seker. Shared information and program plagiarism detection. IEEE Trans Inf Theory, 50(7):1545-1551, 2004.
Hervé Debar, Marc Dacier, and Andreas Wespi. Towards a taxonomy of intrusion-detection systems. Computer Networks, 31(8):805-822, 1999.
Maria Federico and Nadia Pisanti. Suffix tree characterization of maximal motifs in biological sequences. Theor. Comput. Sci., 410(43):4391-4401, 2009.
Roberto Grossi, Andrea Pietracaprina, Nadia Pisanti, Geppino Pucci, Eli Upfal, and Fabio Vandin. MADMX: A strategy for maximal dense motif extraction. J. Comp. Biol., 18(4):535-545, 2011.
D. Harel and R. E. Tarjan. Fast algorithms for finding nearest common ancestors. SIAM J. Comput., 13(2):338-355, 1984.
Nizar R. Mabroukeh and Christie I. Ezeife. A taxonomy of sequential pattern mining algorithms. ACM CSUR, 43(1):3, 2010.
Edward M. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM, 23(2):262-272, April 1976.
L. Parida, I. Rigoutsos, and D. E. Platt. An output-sensitive flexible pattern discovery algorithm. In Proc. 12th CPM, pages 131-142, 2001.
Laxmi Parida, Isidore Rigoutsos, Aris Floratos, Daniel E. Platt, and Yuan Gao. Pattern discovery on character sets and real-valued data: linear bound on irredundant motifs and an efficient polynomial time algorithm. In Proc. 11th SODA, pages 297-308, 2000.
Lukáš Pichl, Takuya Yamano, and Taisei Kaizoji. On the symbolic analysis of market indicators with the dynamic programming approach. In Advances in Neural Networks-ISNN, pages 432-441. Springer, 2006.
Isidore Rigoutsos and Tien Huynh. Chung-Kwei: a Pattern-discovery-based System for the Automatic Identification of Unsolicited E-mail Messages. In CEAS, 2004.
Marie-France Sagot. Spelling approximate repeated or common motifs using a suffix tree. In Proc. 3rd LATIN, pages 374-390. Springer, 1998.
Reza Sherkat and Davood Rafiei. Efficiently evaluating order preserving similarity queries over historical market-basket data. In Proc. 22nd ICDE, pages 19-19, 2006.
Esko Ukkonen. Maximal and minimal representations of gapped and non-gapped motifs of a string. Theor. Comput. Sci., 410(43):4341-4349, 2009.