Fast Sketch-based Recovery of Correlation Outliers

Authors Graham Cormode, Jacques Dark



PDF
Thumbnail PDF

File

LIPIcs.ICDT.2018.13.pdf
  • Filesize: 0.56 MB
  • 18 pages

Document Identifiers

Author Details

Graham Cormode
Jacques Dark

Cite AsGet BibTex

Graham Cormode and Jacques Dark. Fast Sketch-based Recovery of Correlation Outliers. In 21st International Conference on Database Theory (ICDT 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 98, pp. 13:1-13:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)
https://doi.org/10.4230/LIPIcs.ICDT.2018.13

Abstract

Many data sources can be interpreted as time-series, and a key problem is to identify which pairs out of a large collection of signals are highly correlated. We expect that there will be few, large, interesting correlations, while most signal pairs do not have any strong correlation. We abstract this as the problem of identifying the highly correlated pairs in a collection of n mostly pairwise uncorrelated random variables, where observations of the variables arrives as a stream. Dimensionality reduction can remove dependence on the number of observations, but further techniques are required to tame the quadratic (in n) cost of a search through all possible pairs. We develop a new algorithm for rapidly finding large correlations based on sketch techniques with an added twist: we quickly generate sketches of random combinations of signals, and use these in concert with ideas from coding theory to decode the identity of correlated pairs. We prove correctness and compare performance and effectiveness with the best LSH (locality sensitive hashing) based approach.
Keywords
  • correlation
  • sketching
  • streaming
  • dimensionality reduction

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. D. Achlioptas. Database-friendly random projections. In ACM Principles of Database Systems, pages 274-281, 2001. Google Scholar
  2. N. Alon, P. Gibbons, Y. Matias, and M. Szegedy. Tracking join and self-join sizes in limited storage. In ACM Principles of Database Systems, pages 10-20, 1999. Google Scholar
  3. N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In ACM Symposium on Theory of Computing, pages 20-29, 1996. Google Scholar
  4. Alexandr Andoni and Ilya P. Razenshteyn. Optimal data-dependent hashing for approximate near neighbors. CoRR, abs/1501.01062, 2015. URL: http://arxiv.org/abs/1501.01062.
  5. Emmanuel Candes, Mark Rudelson, Terence Tao, and Roman Vershynin. Error correction via linear programming. In Foundations of Computer Science, 2005. FOCS 2005. 46th Annual IEEE Symposium on, pages 668-681. IEEE, 2005. Google Scholar
  6. M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. In Procedings of the International Colloquium on Automata, Languages and Programming (ICALP), 2002. Google Scholar
  7. Graham Cormode. Sketch techniques for massive data. In Graham Cormode, Minos Garofalakis, Peter Haas, and Chris Jermaine, editors, Synposes for Massive Data: Samples, Histograms, Wavelets and Sketches, Foundations and Trends in Databases. NOW publishers, 2011. Google Scholar
  8. David L Donoho. Compressed sensing. IEEE Transactions on information theory, 52(4):1289-1306, 2006. Google Scholar
  9. Anna C Gilbert, Yi Li, Ely Porat, and Martin J Strauss. Approximate sparse recovery: optimizing time and measurements. SIAM Journal on Computing, 41(2):436-453, 2012. Google Scholar
  10. P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In ACM Symposium on Theory of Computing, pages 604-613, 1998. Google Scholar
  11. W.B. Johnson and J. Lindenstrauss. Extensions of Lipshitz mapping into Hilbert space. Contemporary Mathematics, 26:189-206, 1984. Google Scholar
  12. Daniel M. Kane and Jelani Nelson. Sparser johnson-lindenstrauss transforms. Journal of the ACM, 61(1):4:1-4:23, 2014. Google Scholar
  13. Matti Karppa, Petteri Kaski, and Jukka Kohonen. A faster subquadratic algorithm for finding outlier correlations. In Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '16, pages 1288-1305, Philadelphia, PA, USA, 2016. Society for Industrial and Applied Mathematics. URL: http://dl.acm.org/citation.cfm?id=2884435.2884525.
  14. Kasper Green Larsen, Jelani Nelson, Huy L Nguyên, and Mikkel Thorup. Heavy hitters via cluster-preserving clustering. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 61-70. IEEE, 2016. Google Scholar
  15. P. Li, T. Hastie, and K. W. Church. Nonlinear estimators and tail bounds for dimension reduction in L₁ using cauchy random projections. Journal of Machine Learning Research (JMLR), 2007. Google Scholar
  16. Rasmus Pagh. Compressed matrix multiplication. TOCT, 5(3):9:1-9:17, 2013. URL: http://dx.doi.org/10.1145/2493252.2493254.
  17. Eric Price and David P. Woodruff. (1 + eps)-approximate sparse recovery. In Rafail Ostrovsky, editor, IEEE 52nd Annual Symposium on Foundations of Computer Science, FOCS 2011, Palm Springs, CA, USA, October 22-25, 2011, pages 295-304. IEEE Computer Society, 2011. URL: http://dx.doi.org/10.1109/FOCS.2011.92.
  18. Daniel A. Spielman. Linear-time encodable and decodable error-correcting codes. IEEE Trans. Information Theory, 42(6):1723-1731, 1996. URL: http://dx.doi.org/10.1109/18.556668.
  19. Gregory Valiant. Finding correlations in subquadratic time, with applications to learning parities and the closest pair problem. J. ACM, 62(2):13:1-13:45, 2015. URL: http://dx.doi.org/10.1145/2728167.
  20. Leslie Valiant. Functionality in neural nets. In First Workshop on Computational Learning Theory, page 28–39, 1988. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail