L_p Pattern Matching in a Stream

Authors Tatiana Starikovskaya, Michal Svagerka, Przemysław Uznański



PDF
Thumbnail PDF

File

LIPIcs.APPROX-RANDOM.2020.35.pdf
  • Filesize: 0.63 MB
  • 23 pages

Document Identifiers

Author Details

Tatiana Starikovskaya
  • DIENS, École normale supérieure, PSL Research University, Paris, France
Michal Svagerka
  • ETH Zürich, Switzerland
Przemysław Uznański
  • Institute of Computer Science, University of Wrocław, Poland

Cite AsGet BibTex

Tatiana Starikovskaya, Michal Svagerka, and Przemysław Uznański. L_p Pattern Matching in a Stream. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 176, pp. 35:1-35:23, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)
https://doi.org/10.4230/LIPIcs.APPROX/RANDOM.2020.35

Abstract

We consider the problem of computing distance between a pattern of length n and all n-length subwords of a text in the streaming model. In the streaming setting, only the Hamming distance (L₀) has been studied. It is known that computing the exact Hamming distance between a pattern and a streaming text requires Ω(n) space (folklore). Therefore, to develop sublinear-space solutions, one must relax their requirements. One possibility to do so is to compute only the distances bounded by a threshold k, see [SODA'19, Clifford, Kociumaka, Porat] and references therein. The motivation for this variant of this problem is that we are interested in subwords of the text that are similar to the pattern, i.e. in subwords such that the distance between them and the pattern is relatively small. On the other hand, the main application of the streaming setting is processing large-scale data, such as biological data. Recent advances in hardware technology allow generating such data at a very high speed, but unfortunately, the produced data may contain about 10% of noise [Biol. Direct.'07, Klebanov and Yakovlev]. To analyse such data, it is not sufficient to consider small distances only. A possible workaround for this issue is the (1±ε)-approximation. This line of research was initiated in [ICALP'16, Clifford and Starikovskaya] who gave a (1±ε)-approximation algorithm with space 𝒪~(ε^{-5}√n). In this work, we show a suite of new streaming algorithms for computing the Hamming, L₁, L₂ and general L_p (0 < p < 2) distances between the pattern and the text. Our results significantly extend over the previous result in this setting. In particular, for the Hamming distance and for the L_p distance when 0 < p ≤ 1 we show a streaming algorithm that uses 𝒪~(ε^{-2}√n) space for polynomial-size alphabets.

Subject Classification

ACM Subject Classification
  • Theory of computation → Streaming, sublinear and near linear time algorithms
Keywords
  • streaming algorithms
  • approximate pattern matching

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Karl R. Abrahamson. Generalized string matching. SIAM J. Comput., 16(6):1039-1051, 1987. Google Scholar
  2. Dimitris Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. J. Comput. Syst. Sci., 66(4):671-687, 2003. URL: https://doi.org/10.1016/S0022-0000(03)00025-4.
  3. Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, and Luca Trevisan. Counting distinct elements in a data stream. In RANDOM 2002, pages 1-10, 2002. URL: https://doi.org/10.1007/3-540-45726-7_1.
  4. Vladimir Braverman, Ran Gelles, and Rafail Ostrovsky. How to catch L₂-heavy-hitters on sliding windows. Theor. Comput. Sci., 554:82-94, 2014. Google Scholar
  5. Vladimir Braverman and Rafail Ostrovsky. Smooth histograms for sliding windows. In FOCS 2007, pages 283-293, 2007. Google Scholar
  6. Vladimir Braverman and Rafail Ostrovsky. Effective computations on sliding windows. SIAM J. Comput., 39(6):2113-2131, 2010. Google Scholar
  7. Vladimir Braverman, Rafail Ostrovsky, and Alan Roytman. Zero-one laws for sliding windows and universal sketches. In APPROX-RANDOM 2015, pages 573-590, 2015. Google Scholar
  8. Vladimir Braverman, Rafail Ostrovsky, and Carlo Zaniolo. Optimal sampling from sliding windows. J. Comput. Syst. Sci., 78(1):260-272, 2012. Google Scholar
  9. A. Robert Calderbank, Anna C. Gilbert, Kirill Levchenko, S. Muthukrishnan, and Martin Strauss. Improved range-summable random variable construction algorithms. In SODA 2005, pages 840-849, 2005. Google Scholar
  10. Amit Chakrabarti and Oded Regev. An optimal lower bound on the communication complexity of gap-Hamming-distance. In STOC 2011, pages 51-60, 2011. URL: https://doi.org/10.1145/1993636.1993644.
  11. J. M. Chambers, C. L. Mallows, and B. W. Stuck. A method for simulating stable random variables. J. of the American Statistical Association, 71(354):340-344, 1976. Google Scholar
  12. Timothy M. Chan, Shay Golan, Tomasz Kociumaka, Tsvi Kopelowitz, and Ely Porat. Approximating text-to-pattern Hamming distances. In STOC 2020, pages 643-656, 2020. URL: https://doi.org/10.1145/3357713.3384266.
  13. Peter Clifford, Raphaël Clifford, and Costas S. Iliopoulos. Faster algorithms for delta, gamma-matching and related problems. In CPM 2005, pages 68-78, 2005. URL: https://doi.org/10.1007/11496656_7.
  14. Raphaël Clifford, Klim Efremenko, Ely Porat, and Amir Rothschild. From coding theory to efficient pattern matching. In SODA 2009, pages 778-784, 2009. Google Scholar
  15. Raphaël Clifford, Allyx Fontaine, Ely Porat, Benjamin Sach, and Tatiana Starikovskaya. The k-mismatch problem revisited. In SODA 2016, pages 2039-2052, 2016. URL: https://doi.org/10.1137/1.9781611974331.ch142.
  16. Raphaël Clifford, Tomasz Kociumaka, and Ely Porat. The streaming k-mismatch problem. In SODA 2019, pages 1106-1125, 2019. URL: https://doi.org/10.1137/1.9781611975482.68.
  17. Raphaël Clifford and Tatiana Starikovskaya. Approximate Hamming distance in a stream. In ICALP 2016, pages 20:1-20:14, 2016. URL: https://doi.org/10.4230/LIPIcs.ICALP.2016.20.
  18. Mayur Datar, Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Maintaining stream statistics over sliding windows. SIAM J. Comput., 31(6):1794-1813, 2002. URL: https://doi.org/10.1137/S0097539701398363.
  19. Marianne Durand and Philippe Flajolet. Loglog counting of large cardinalities (extended abstract). In ESA 2003, pages 605-617, 2003. URL: https://doi.org/10.1007/978-3-540-39658-1_55.
  20. Joan Feigenbaum, Sampath Kannan, Martin Strauss, and Mahesh Viswanathan. An approximate L1-difference algorithm for massive data streams. SIAM J. Comput., 32(1):131-151, 2002. URL: https://doi.org/10.1137/S0097539799361701.
  21. Michael J. Fischer and Michael S. Paterson. String-matching and other products. Technical report, Massachusetts Institute of Technology, 1974. Google Scholar
  22. Sumit Ganguly. Counting distinct items over update streams. Theor. Comput. Sci., 378(3):211-222, 2007. URL: https://doi.org/10.1016/j.tcs.2007.02.031.
  23. Paweł Gawrychowski and Przemysław Uznański. Towards unified approximate pattern matching for Hamming and L₁ distance. In ICALP 2018, pages 62:1-62:13, 2018. URL: https://doi.org/10.4230/LIPIcs.ICALP.2018.62.
  24. Phillip B. Gibbons and Srikanta Tirthapura. Estimating simple functions on the union of data streams. In SPAA 2001, pages 281-291, 2001. URL: https://doi.org/10.1145/378580.378687.
  25. Shay Golan, Tsvi Kopelowitz, and Ely Porat. Towards optimal approximate streaming pattern matching by matching multiple patterns in multiple streams. In ICALP 2018, pages 65:1-65:16, 2018. Google Scholar
  26. Piotr Indyk. Stable distributions, pseudorandom generators, embeddings, and data stream computation. J. ACM, 53(3):307-323, 2006. URL: https://doi.org/10.1145/1147954.1147955.
  27. Thathachar S. Jayram and David P. Woodruff. Optimal bounds for Johnson-Lindenstrauss transforms and streaming problems with subconstant error. ACM Trans. Algorithms, 9(3):26, 2013. Google Scholar
  28. Daniel M. Kane, Jelani Nelson, and David P. Woodruff. An optimal algorithm for the distinct elements problem. In PODS 2010, pages 41-52, 2010. URL: https://doi.org/10.1145/1807085.1807094.
  29. Howard J. Karloff. Fast algorithms for approximately counting mismatches. Inf. Process. Lett., 48(2):53-60, 1993. URL: https://doi.org/10.1016/0020-0190(93)90177-B.
  30. Lev Klebanov and Andrei Yakovlev. How high is the level of technical noise in microarray data? Biol Direct., 2:9, April 2007. URL: https://doi.org/10.1186/1745-6150-2-9.
  31. Tsvi Kopelowitz and Ely Porat. Breaking the variance: Approximating the Hamming distance in 1/ε time per alignment. In FOCS 2015, pages 601-613, 2015. URL: https://doi.org/10.1109/FOCS.2015.43.
  32. Tsvi Kopelowitz and Ely Porat. A simple algorithm for approximating the text-to-pattern Hamming distance. In SOSA@SODA 2018, pages 10:1-10:5, 2018. URL: https://doi.org/10.4230/OASIcs.SOSA.2018.10.
  33. S. R. Kosaraju. Efficient string matching. Manuscript, 1987. Google Scholar
  34. Karim Labib, Przemysław Uznański, and Daniel Wolleb-Graf. Hamming distance completeness. In CPM 2019, pages 14:1-14:17, 2018. URL: https://doi.org/10.4230/LIPIcs.CPM.2019.14.
  35. Ohad Lipsky and Ely Porat. L₁ pattern matching lower bound. Inf. Process. Lett., 105(4):141-143, 2008. URL: https://doi.org/10.1016/j.ipl.2007.08.011.
  36. Noam Nisan. Pseudorandom generators for space-bounded computation. Combinatorica, 12:4:449-461, 1992. URL: https://doi.org/10.1007/BF01305237.
  37. A. Pavan and Srikanta Tirthapura. Range-efficient counting of distinct elements in a massive data stream. SIAM J. Comput., 37(2):359-379, 2007. URL: https://doi.org/10.1137/050643672.
  38. Benny Porat and Ely Porat. Exact and approximate pattern matching in the streaming model. In FOCS 2009, pages 315-323, 2009. URL: https://doi.org/10.1109/FOCS.2009.11.
  39. Jan Studený and Przemysław Uznański. Approximating approximate pattern matching. In CPM 2019, volume 128, pages 15:1-15:13, 2019. URL: https://doi.org/10.4230/LIPIcs.CPM.2019.15.
  40. Vladimir M. Zolotarev. One-dimensional stable distributions, volume 65 of Translations of Mathematical Monographs. American Mathematical Soc., 1986. Translated from Russian by H.H. McFaden, Translation edited by B. Silver. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail