Output-Sensitive Pattern Extraction in Sequences

Authors Roberto Grossi, Giulia Menconi, Nadia Pisanti, Roberto Trani, Soren Vind



PDF
Thumbnail PDF

File

LIPIcs.FSTTCS.2014.303.pdf
  • Filesize: 0.55 MB
  • 12 pages

Document Identifiers

Author Details

Roberto Grossi
Giulia Menconi
Nadia Pisanti
Roberto Trani
Soren Vind

Cite As Get BibTex

Roberto Grossi, Giulia Menconi, Nadia Pisanti, Roberto Trani, and Soren Vind. Output-Sensitive Pattern Extraction in Sequences. In 34th International Conference on Foundation of Software Technology and Theoretical Computer Science (FSTTCS 2014). Leibniz International Proceedings in Informatics (LIPIcs), Volume 29, pp. 303-314, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2014) https://doi.org/10.4230/LIPIcs.FSTTCS.2014.303

Abstract

Genomic Analysis, Plagiarism Detection, Data Mining, Intrusion Detection, Spam Fighting and Time Series Analysis are just some examples of applications where extraction of recurring patterns in sequences of objects is one of the main computational challenges. Several notions of patterns exist, and many share the common idea of strictly specifying some parts of the pattern and to don't care about the remaining parts. Since the number of patterns can be exponential in the length of the sequences, pattern extraction focuses on statistically relevant patterns, where any attempt to further refine or extend them causes a loss of significant information (where the number of occurrences changes). Output-sensitive algorithms have been proposed to enumerate and list these patterns, taking polynomial time O(n^c) per pattern for constant c > 1, which is impractical for massive sequences of very large length n.

We address the problem of extracting maximal patterns with at most k don't care symbols and at least q occurrences. Our contribution is to give the first algorithm that attains a stronger notion of output-sensitivity, borrowed from the analysis of data structures: the cost is proportional to the actual number of occurrences of each pattern, which is at most n and practically much smaller than n in real applications, thus avoiding the aforementioned cost of O(n^c) per pattern.

Subject Classification

Keywords
  • Pattern Extraction
  • Motif Detection
  • Pattern Discovery
  • Motif Trie

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Mohamed Ibrahim Abouelhoda and Moustafa Ghanem. String mining in bioinformatics. In Scientific Data Mining and Knowledge Discovery, pages 207-247. Springer, 2010. Google Scholar
  2. Mohamed Ibrahim Abouelhoda, Stefan Kurtz, and Enno Ohlebusch. Replacing suffix trees with enhanced suffix arrays. JDA, 2(1):53-86, 2004. Google Scholar
  3. Alberto Apostolico, Matteo Comin, and Laxmi Parida. Bridging lossy and lossless compression by motif pattern discovery. In General Theory of Information Transfer and Combinatorics, pages 793-813. Springer, 2006. Google Scholar
  4. Hiroki Arimura and Takeaki Uno. An efficient polynomial space and polynomial delay algorithm for enumeration of maximal motifs in a sequence. JCO, 2007. Google Scholar
  5. Brenda S Baker. On finding duplication and near-duplication in large software systems. In Proc. 2nd WCRE, pages 86-95, 1995. Google Scholar
  6. Sergey Brin, James Davis, and Héctor García-Molina. Copy detection mechanisms for digital documents. SIGMOD Rec., 24(2):398-409, May 1995. Google Scholar
  7. Chia-Hui Chang, Chun-Nan Hsu, and Shao-Cheng Lui. Automatic information extraction from semi-structured web pages by pattern discovery. Decis Support Syst, 34(1):129-147, 2003. Google Scholar
  8. Xin Chen, Brent Francia, Ming Li, Brian Mckinnon, and Amit Seker. Shared information and program plagiarism detection. IEEE Trans Inf Theory, 50(7):1545-1551, 2004. Google Scholar
  9. Hervé Debar, Marc Dacier, and Andreas Wespi. Towards a taxonomy of intrusion-detection systems. Computer Networks, 31(8):805-822, 1999. Google Scholar
  10. Maria Federico and Nadia Pisanti. Suffix tree characterization of maximal motifs in biological sequences. Theor. Comput. Sci., 410(43):4391-4401, 2009. Google Scholar
  11. Roberto Grossi, Andrea Pietracaprina, Nadia Pisanti, Geppino Pucci, Eli Upfal, and Fabio Vandin. MADMX: A strategy for maximal dense motif extraction. J. Comp. Biol., 18(4):535-545, 2011. Google Scholar
  12. D. Harel and R. E. Tarjan. Fast algorithms for finding nearest common ancestors. SIAM J. Comput., 13(2):338-355, 1984. Google Scholar
  13. Nizar R. Mabroukeh and Christie I. Ezeife. A taxonomy of sequential pattern mining algorithms. ACM CSUR, 43(1):3, 2010. Google Scholar
  14. Edward M. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM, 23(2):262-272, April 1976. Google Scholar
  15. L. Parida, I. Rigoutsos, and D. E. Platt. An output-sensitive flexible pattern discovery algorithm. In Proc. 12th CPM, pages 131-142, 2001. Google Scholar
  16. Laxmi Parida, Isidore Rigoutsos, Aris Floratos, Daniel E. Platt, and Yuan Gao. Pattern discovery on character sets and real-valued data: linear bound on irredundant motifs and an efficient polynomial time algorithm. In Proc. 11th SODA, pages 297-308, 2000. Google Scholar
  17. Lukáš Pichl, Takuya Yamano, and Taisei Kaizoji. On the symbolic analysis of market indicators with the dynamic programming approach. In Advances in Neural Networks-ISNN, pages 432-441. Springer, 2006. Google Scholar
  18. Isidore Rigoutsos and Tien Huynh. Chung-Kwei: a Pattern-discovery-based System for the Automatic Identification of Unsolicited E-mail Messages. In CEAS, 2004. Google Scholar
  19. Marie-France Sagot. Spelling approximate repeated or common motifs using a suffix tree. In Proc. 3rd LATIN, pages 374-390. Springer, 1998. Google Scholar
  20. Reza Sherkat and Davood Rafiei. Efficiently evaluating order preserving similarity queries over historical market-basket data. In Proc. 22nd ICDE, pages 19-19, 2006. Google Scholar
  21. Esko Ukkonen. Maximal and minimal representations of gapped and non-gapped motifs of a string. Theor. Comput. Sci., 410(43):4341-4349, 2009. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail