AWLCO: All-Window Length Co-Occurrence

Authors Joshua Sobel, Noah Bertram, Chen Ding, Fatemeh Nargesian, Daniel Gildea



PDF
Thumbnail PDF

File

LIPIcs.CPM.2021.24.pdf
  • Filesize: 1.23 MB
  • 21 pages

Document Identifiers

Author Details

Joshua Sobel
  • Department of Computer Science, University of Rochester, NY, USA
Noah Bertram
  • Department of Mathematics, University of Rochester, NY, USA
Chen Ding
  • Department of Computer Science, University of Rochester, NY, USA
Fatemeh Nargesian
  • Department of Computer Science, University of Rochester, NY, USA
Daniel Gildea
  • Department of Computer Science, University of Rochester, NY, USA

Acknowledgements

We would like to give special thanks to Lu Zhang and Katherine Seeman for their efforts for the implementation and experimental evaluation of our algorithms. We thank one of our reviewers for suggesting the Aho-Corasick algorithm.

Cite As Get BibTex

Joshua Sobel, Noah Bertram, Chen Ding, Fatemeh Nargesian, and Daniel Gildea. AWLCO: All-Window Length Co-Occurrence. In 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 191, pp. 24:1-24:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021) https://doi.org/10.4230/LIPIcs.CPM.2021.24

Abstract

Analyzing patterns in a sequence of events has applications in text analysis, computer programming, and genomics research. In this paper, we consider the all-window-length analysis model which analyzes a sequence of events with respect to windows of all lengths. We study the exact co-occurrence counting problem for the all-window-length analysis model. Our first algorithm is an offline algorithm that counts all-window-length co-occurrences by performing multiple passes over a sequence and computing single-window-length co-occurrences. This algorithm has the time complexity O(n) for each window length and thus a total complexity of O(n²) and the space complexity O(|I|) for a sequence of size n and an itemset of size |I|. We propose AWLCO, an online algorithm that computes all-window-length co-occurrences in a single pass with the time complexity of O(n) and space complexity of O(√{n|I|}), assuming perfect hashing. Following this, we generalize our use case to patterns in which we propose an algorithm that computes all-window-length co-occurrence with time complexity O(n|I|), assuming perfect hashing, with an additional pre-processing step and space complexity O(√{n|I|}+|I|), plus the overhead of the Aho-Corasick algorithm [Aho and Corasick, 1975].

Subject Classification

ACM Subject Classification
  • Theory of computation → Streaming, sublinear and near linear time algorithms
Keywords
  • Itemsets
  • Data Sequences
  • Co-occurrence

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Rakesh Agrawal, Tomasz Imielinski, and Arun N. Swami. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data (SIGMOD), pages 207-216, 1993. URL: https://doi.org/10.1145/170036.170072.
  2. Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules in large databases. In Proceedings of the International Conference on Very Large Data Bases (VLDB), pages 487-499, 1994. URL: https://doi.org/10.5555/645920.672836.
  3. Alfred V. Aho and Margaret J. Corasick. Efficient string matching: An aid to bibliographic search. Commun. ACM, 18(6):333–340, 1975. URL: https://doi.org/10.1145/360825.360855.
  4. Joong Hyuk Chang and Won Suk Lee. Finding recent frequent itemsets adaptively over online data streams. In International Conference on Knowledge Discovery and Data Mining (SIGKDD), pages 487-492, 2003. URL: https://doi.org/10.1145/956750.956807.
  5. Graham Cormode and Marios Hadjieleftheriou. Finding frequent items in data streams. Proceedings of the VLDB Endowment (PVLDB), 1(2):1530-1541, 2008. URL: https://doi.org/10.14778/1454159.1454225.
  6. Mayur Datar, Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Maintaining stream statistics over sliding windows. SIAM J. Comput., 31(6):1794-1813, 2002. URL: https://doi.org/10.1137/S0097539701398363.
  7. Mayur Datar and Rajeev Motwani. The sliding-window computation model and results. In Data Stream Management - Processing High-Speed Data Streams, pages 149-165. Springer, 2016. URL: https://doi.org/10.1007/978-0-387-47534-9_8.
  8. Xiangjun Du, Zhuo Wang, Aiping Wu, Lin Song, Yang Cao, Haiying Hang, and Taijiao Jiang. Networks of genomic co-occurrence capture characteristics of human influenza A (H3N2) evolution. Genome Research, 18(1):178-187, January 2008. URL: https://doi.org/10.1101/gr.6969007.
  9. Philippe Flajolet and G. Nigel Martin. Probabilistic counting. In Proceedings of the Symposium on Foundations of Computer Science (FOCS), pages 76-82, 1983. URL: https://doi.org/10.1109/SFCS.1983.46.
  10. Rahman Lavaee, John Criswell, and Chen Ding. Codestitcher: inter-procedural basic block layout optimization. In Proceedings of the International Conference on Compiler Construction, pages 65-75, 2019. URL: https://doi.org/10.1145/3302516.3307358.
  11. Carson Kai-Sang Leung and Quamrul I. Khan. DSTree: A tree structure for the mining of frequent sets from data streams. In Proceedings of the International Conference on Data Mining (ICDM), pages 928-932, 2006. URL: https://doi.org/10.1109/ICDM.2006.62.
  12. Omer Levy and Yoav Goldberg. Dependency-based word embeddings. In Proceedings of the 52nd Anuual Meeting of the Association for Computational Linguistics (ACL), pages 302-308, 2014. URL: https://doi.org/10.3115/v1/P14-2050.
  13. Yumeng (Lucinda) Liu, Daniel Busaba, Chen Ding, and Daniel Gildea. All timescale window co-occurrence: Efficient analysis and a possible use. In Proceedings of the 28th Annual International Conference on Computer Science and Software Engineering, CASCON '18, pages 289-292, Riverton, NJ, USA, 2018. IBM Corp. URL: https://doi.org/10.5555/3291291.3291322.
  14. Gurmeet Singh Manku and Rajeev Motwani. Approximate frequency counts over data streams. In Proceedings of the International Conference on Very Large Data Bases (VLDB), pages 346-357, 2002. URL: https://doi.org/10.1016/B978-155860869-6/50038-X.
  15. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111-3119, 2013. URL: https://doi.org/10.5555/2999792.2999959.
  16. Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532-1543, Doha, Qatar, 2014. Google Scholar
  17. Anand Rajaraman, Jure Leskovec, and Jeffrey D. Ullman. Mining Massive Datasets. Cambridge University Press, 2014. URL: https://doi.org/10.5555/2124405.
  18. Hinrich Schütze. Ambiguity Resolution in Language Learning - Computational and Cognitive Models. Number 10 in CSLI Lecture Notes Series. Center for the Study of Language and Information, Stanford, California, 1997. Google Scholar
  19. Jason W. Shapiro and Catherine Putonti. Gene co-occurrence networks reflect bacteriophage ecology and evolution. mBio, 9(2), 2018. URL: https://doi.org/10.1128/mBio.01870-17.
  20. Tarique Siddiqui, Paul Luh, Zesheng Wang, Karrie Karahalios, and Aditya G. Parameswaran. Shapesearch: A flexible and efficient system for shape-based exploration of trendlines. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD), pages 51-65, 2020. URL: https://doi.org/10.1145/3318464.3389722.
  21. Ziqiang Yu, Xiaohui Yu, Yang Liu, Wenzhu Li, and Jian Pei. Mining frequent co-occurrence patterns across multiple data streams. In International Conference on Extending Database Technology (EDBT), pages 73-84, 2015. URL: https://doi.org/10.5441/002/edbt.2015.08.
  22. Chengliang Zhang, Chen Ding, Mitsunori Ogihara, Yutao Zhong, and Youfeng Wu. A hierarchical model of data locality. In Proceedings of the ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 16-29, 2006. URL: https://doi.org/10.1145/1111037.1111040.
  23. Chengliang Zhang, Yutao Zhong, Chen Ding, and Mitsunori Ogihara. Finding reference affinity groups in trace using sampling method. Technical report, Department of Computer Science, University of Rochester, 2004. Google Scholar
  24. Yutao Zhong, Maksim Orlovich, Xipeng Shen, and Chen Ding. Array regrouping and structure splitting using whole-program reference affinity. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 255-266, 2004. URL: https://doi.org/10.1145/996841.996872.
  25. Yuanqiang Zou, Zhiqiang Wu, Lizong Deng, Aiping Wu, Fan Wu, Kenli Li, Taijiao Jiang, and Yousong Peng. cooccurNet: an R package for co-occurrence network construction and analysis. Bioinformatics, 33(12):1881-1882, 2017. URL: https://doi.org/10.1093/bioinformatics/btx062.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail