Fast Sketch-based Recovery of Correlation Outliers

Cormode, Graham; Dark, Jacques

doi:10.4230/LIPIcs.ICDT.2018.13

File

Cite AsGet BibTex

Graham Cormode and Jacques Dark. Fast Sketch-based Recovery of Correlation Outliers. In 21st International Conference on Database Theory (ICDT 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 98, pp. 13:1-13:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)
https://doi.org/10.4230/LIPIcs.ICDT.2018.13

Abstract

Many data sources can be interpreted as time-series, and a key problem is to identify which pairs out of a large collection of signals are highly correlated. We expect that there will be few, large, interesting correlations, while most signal pairs do not have any strong correlation. We abstract this as the problem of identifying the highly correlated pairs in a collection of n mostly pairwise uncorrelated random variables, where observations of the variables arrives as a stream. Dimensionality reduction can remove dependence on the number of observations, but further techniques are required to tame the quadratic (in n) cost of a search through all possible pairs. We develop a new algorithm for rapidly finding large correlations based on sketch techniques with an added twist: we quickly generate sketches of random combinations of signals, and use these in concert with ideas from coding theory to decode the identity of correlated pairs. We prove correctness and compare performance and effectiveness with the best LSH (locality sensitive hashing) based approach.

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

D. Achlioptas. Database-friendly random projections. In ACM Principles of Database Systems, pages 274-281, 2001.
N. Alon, P. Gibbons, Y. Matias, and M. Szegedy. Tracking join and self-join sizes in limited storage. In ACM Principles of Database Systems, pages 10-20, 1999.
N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In ACM Symposium on Theory of Computing, pages 20-29, 1996.
Alexandr Andoni and Ilya P. Razenshteyn. Optimal data-dependent hashing for approximate near neighbors. CoRR, abs/1501.01062, 2015. URL: http://arxiv.org/abs/1501.01062.
Emmanuel Candes, Mark Rudelson, Terence Tao, and Roman Vershynin. Error correction via linear programming. In Foundations of Computer Science, 2005. FOCS 2005. 46th Annual IEEE Symposium on, pages 668-681. IEEE, 2005.
M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. In Procedings of the International Colloquium on Automata, Languages and Programming (ICALP), 2002.
Graham Cormode. Sketch techniques for massive data. In Graham Cormode, Minos Garofalakis, Peter Haas, and Chris Jermaine, editors, Synposes for Massive Data: Samples, Histograms, Wavelets and Sketches, Foundations and Trends in Databases. NOW publishers, 2011.
David L Donoho. Compressed sensing. IEEE Transactions on information theory, 52(4):1289-1306, 2006.
Anna C Gilbert, Yi Li, Ely Porat, and Martin J Strauss. Approximate sparse recovery: optimizing time and measurements. SIAM Journal on Computing, 41(2):436-453, 2012.
P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In ACM Symposium on Theory of Computing, pages 604-613, 1998.
W.B. Johnson and J. Lindenstrauss. Extensions of Lipshitz mapping into Hilbert space. Contemporary Mathematics, 26:189-206, 1984.
Daniel M. Kane and Jelani Nelson. Sparser johnson-lindenstrauss transforms. Journal of the ACM, 61(1):4:1-4:23, 2014.
Matti Karppa, Petteri Kaski, and Jukka Kohonen. A faster subquadratic algorithm for finding outlier correlations. In Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '16, pages 1288-1305, Philadelphia, PA, USA, 2016. Society for Industrial and Applied Mathematics. URL: http://dl.acm.org/citation.cfm?id=2884435.2884525.
Kasper Green Larsen, Jelani Nelson, Huy L Nguyên, and Mikkel Thorup. Heavy hitters via cluster-preserving clustering. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 61-70. IEEE, 2016.
P. Li, T. Hastie, and K. W. Church. Nonlinear estimators and tail bounds for dimension reduction in L₁ using cauchy random projections. Journal of Machine Learning Research (JMLR), 2007.
Rasmus Pagh. Compressed matrix multiplication. TOCT, 5(3):9:1-9:17, 2013. URL: http://dx.doi.org/10.1145/2493252.2493254.
Eric Price and David P. Woodruff. (1 + eps)-approximate sparse recovery. In Rafail Ostrovsky, editor, IEEE 52nd Annual Symposium on Foundations of Computer Science, FOCS 2011, Palm Springs, CA, USA, October 22-25, 2011, pages 295-304. IEEE Computer Society, 2011. URL: http://dx.doi.org/10.1109/FOCS.2011.92.
Daniel A. Spielman. Linear-time encodable and decodable error-correcting codes. IEEE Trans. Information Theory, 42(6):1723-1731, 1996. URL: http://dx.doi.org/10.1109/18.556668.
Gregory Valiant. Finding correlations in subquadratic time, with applications to learning parities and the closest pair problem. J. ACM, 62(2):13:1-13:45, 2015. URL: http://dx.doi.org/10.1145/2728167.
Leslie Valiant. Functionality in neural nets. In First Workshop on Computational Learning Theory, page 28–39, 1988.

Fast Sketch-based Recovery of Correlation Outliers

Authors Graham Cormode, Jacques Dark

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Keywords

Metrics

References

Thanks for your feedback!

Could not send message