Improved Algorithms for Time Decay Streams

Authors Vladimir Braverman, Harry Lang, Enayat Ullah, Samson Zhou



PDF
Thumbnail PDF

File

LIPIcs.APPROX-RANDOM.2019.27.pdf
  • Filesize: 0.55 MB
  • 17 pages

Document Identifiers

Author Details

Vladimir Braverman
  • Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
Harry Lang
  • MIT CSAIL, Cambridge, MA, USA
Enayat Ullah
  • Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
Samson Zhou
  • School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN, USA

Acknowledgements

This material is based upon work supported in part by the National Science Foundation under Grant No. 1447639, by the Google Faculty Award and by DARPA grant N660001-1-2-4014. Its contents are solely the responsibility of the authors and do not represent the official view of DARPA or the Department of Defense.

Cite AsGet BibTex

Vladimir Braverman, Harry Lang, Enayat Ullah, and Samson Zhou. Improved Algorithms for Time Decay Streams. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 145, pp. 27:1-27:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)
https://doi.org/10.4230/LIPIcs.APPROX-RANDOM.2019.27

Abstract

In the time-decay model for data streams, elements of an underlying data set arrive sequentially with the recently arrived elements being more important. A common approach for handling large data sets is to maintain a coreset, a succinct summary of the processed data that allows approximate recovery of a predetermined query. We provide a general framework that takes any offline-coreset and gives a time-decay coreset for polynomial time decay functions. We also consider the exponential time decay model for k-median clustering, where we provide a constant factor approximation algorithm that utilizes the online facility location algorithm. Our algorithm stores O(k log(h Delta)+h) points where h is the half-life of the decay function and Delta is the aspect ratio of the dataset. Our techniques extend to k-means clustering and M-estimators as well.

Subject Classification

ACM Subject Classification
  • Theory of computation → Facility location and clustering
  • Theory of computation → Streaming, sublinear and near linear time algorithms
Keywords
  • Streaming algorithms
  • approximation algorithms
  • facility location and clustering

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Charu C. Aggarwal, editor. Data Streams - Models and Algorithms, volume 31 of Advances in Database Systems. Springer, 2007. Google Scholar
  2. Arvind Arasu and Gurmeet Singh Manku. Approximate Counts and Quantiles over Sliding Windows. In Proceedings of the Twenty-third ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages 286-296, 2004. Google Scholar
  3. Sepehr Assadi and Sanjeev Khanna. Randomized Composable Coresets for Matching and Vertex Cover. In Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, (SPAA), pages 3-12, 2017. Google Scholar
  4. Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom. Models and Issues in Data Stream Systems. In Proceedings of the Twenty-first ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS), pages 1-16, 2002. Google Scholar
  5. Brian Babcock, Mayur Datar, Rajeev Motwani, and Liadan O'Callaghan. Maintaining variance and k-medians over data stream windows. In Proceedings of the Twenty-Second ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS), pages 234-243, 2003. Google Scholar
  6. Rafael da Ponte Barbosa, Alina Ene, Huy L. Nguyen, and Justin Ward. A New Framework for Distributed Submodular Maximization. In IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 645-654, 2016. Google Scholar
  7. Jon Louis Bentley and James B Saxe. Decomposable searching problems I. Static-to-dynamic transformation. Journal of Algorithms, 1(4):301-358, 1980. Google Scholar
  8. Vladimir Braverman, Petros Drineas, Jalaj Upadhyay, David P. Woodruff, and Samson Zhou. Numerical Linear Algebra in the Sliding Window Model. arXiv preprint, 2018. URL: http://arxiv.org/abs/1805.03765.
  9. Vladimir Braverman, Dan Feldman, and Harry Lang. New frameworks for offline and streaming coreset constructions. arXiv preprint, 2016. URL: http://arxiv.org/abs/1612.00889.
  10. Vladimir Braverman, Elena Grigorescu, Harry Lang, David P. Woodruff, and Samson Zhou. Nearly Optimal Distinct Elements and Heavy Hitters on Sliding Windows. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM, pages 7:1-7:22, 2018. Google Scholar
  11. Vladimir Braverman, Harry Lang, Keith Levin, and Morteza Monemizadeh. Clustering on Sliding Windows in Polylogarithmic Space. In 35th IARCS Annual Conference on Foundation of Software Technology and Theoretical Computer Science, FSTTCS, pages 350-364, 2015. Google Scholar
  12. Vladimir Braverman, Harry Lang, Keith Levin, and Morteza Monemizadeh. Clustering Problems on Sliding Windows. In Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, pages 1374-1390, 2016. Google Scholar
  13. Vladimir Braverman, Adam Meyerson, Rafail Ostrovsky, Alan Roytman, Michael Shindler, and Brian Tagiku. Streaming k-means on well-clusterable data. In Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms, pages 26-40. Society for Industrial and Applied Mathematics, 2011. Google Scholar
  14. Vladimir Braverman and Rafail Ostrovsky. Smooth Histograms for Sliding Windows. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2007), October 20-23, 2007, Providence, RI, USA, Proceedings, pages 283-293, 2007. Google Scholar
  15. Timothy M. Chan and Bashir S. Sadjad. Geometric Optimization Problems over Sliding Windows. Int. J. Comput. Geometry Appl., 16(2-3):145-158, 2006. A preliminary version appeared in the Proceedings of Algorithms and Computation, 15th International Symposium (ISAAC), 2004. Google Scholar
  16. Moses Charikar, Liadan O'Callaghan, and Rina Panigrahy. Better streaming algorithms for clustering problems. In Proceedings of the thirty-fifth annual ACM symposium on Theory of computing, pages 30-39. ACM, 2003. Google Scholar
  17. Ke Chen. On coresets for k-median and k-means clustering in metric and euclidean spaces and their applications. SIAM Journal on Computing, 39(3):923-947, 2009. Google Scholar
  18. Yun Chi, Haixun Wang, Philip S. Yu, and Richard R. Muntz. Catch the moment: maintaining closed frequent itemsets over a data stream sliding window. Knowl. Inf. Syst., 10(3):265-294, 2006. A preliminary version appeared in the Proceedings of the 4th IEEE International Conference on Data Mining (ICDM), 2004. Google Scholar
  19. Kenneth L Clarkson and David P Woodruff. Numerical linear algebra in the streaming model. In Proceedings of the forty-first annual ACM symposium on Theory of computing, pages 205-214. ACM, 2009. Google Scholar
  20. Edith Cohen and Martin Strauss. Maintaining time-decaying stream aggregates. In Proceedings of the Twenty-Second ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages 223-233, 2003. Google Scholar
  21. Graham Cormode, Flip Korn, and Srikanta Tirthapura. Time-decaying aggregates in out-of-order streams. In Proceedings of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS, pages 89-98, 2008. Google Scholar
  22. Graham Cormode, Srikanta Tirthapura, and Bojian Xu. Time-decaying sketches for sensor data aggregation. In Proceedings of the Twenty-Sixth Annual ACM Symposium on Principles of Distributed Computing, PODC, pages 215-224, 2007. Google Scholar
  23. Graham Cormode, Srikanta Tirthapura, and Bojian Xu. Time-Decayed Correlated Aggregates over Data Streams. In Proceedings of the SIAM International Conference on Data Mining, SDM, pages 271-282, 2009. Google Scholar
  24. Anirban Dasgupta, Petros Drineas, Boulos Harb, Ravi Kumar, and Michael W. Mahoney. Sampling Algorithms and Coresets for 𝓁_p Regression. SIAM J. Comput., 38(5):2060-2078, 2009. A preliminary version appeared in the Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms, (SODA) 2008. Google Scholar
  25. Mayur Datar, Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Maintaining Stream Statistics over Sliding Windows. SIAM J. Comput., 31(6):1794-1813, 2002. A preliminary version appeared in the Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2002. Google Scholar
  26. Mayur Datar and S. Muthukrishnan. Estimating Rarity and Similarity over Data Stream Windows. In Algorithms - ESA 2002, 10th Annual European Symposium, Proceedings, pages 323-334, 2002. Google Scholar
  27. Amit Deshpande, Luis Rademacher, Santosh Vempala, and Grant Wang. Matrix approximation and projective clustering via volume sampling. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 1117-1126. Society for Industrial and Applied Mathematics, 2006. Google Scholar
  28. Joan Feigenbaum, Sampath Kannan, and Jian Zhang. Computing Diameter in the Streaming and Sliding-Window Models. Algorithmica, 41(1):25-41, 2005. Google Scholar
  29. Dan Feldman, Amos Fiat, and Micha Sharir. Coresets forWeighted Facilities and Their Applications. In 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 315-324, 2006. Google Scholar
  30. Dan Feldman and Michael Langberg. A unified framework for approximating and clustering data. In Proceedings of the 43rd ACM Symposium on Theory of Computing, STOC, pages 569-578, 2011. Google Scholar
  31. Dan Feldman, Morteza Monemizadeh, and Christian Sohler. A PTAS for k-means clustering based on weak coresets. In Proceedings of the 23rd ACM Symposium on Computational Geometry (SoCG), pages 11-18, 2007. Google Scholar
  32. Dan Feldman, Morteza Monemizadeh, Christian Sohler, and David P Woodruff. Coresets and sketches for high dimensional subspace approximation problems. In Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms, pages 630-649. Society for Industrial and Applied Mathematics, 2010. Google Scholar
  33. Dan Feldman, Melanie Schmidt, and Christian Sohler. Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering. In Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, (SODA), 2013, pages 1434-1453, 2013. Google Scholar
  34. Mina Ghashami, Edo Liberty, Jeff M Phillips, and David P Woodruff. Frequent directions: Simple and deterministic matrix sketching. SIAM Journal on Computing, 45(5):1762-1792, 2016. Google Scholar
  35. Phillip B. Gibbons and Srikanta Tirthapura. Distributed streams algorithms for sliding windows. In SPAA, pages 63-72, 2002. Google Scholar
  36. Sudipto Guha, Nina Mishra, Rajeev Motwani, and Liadan O'Callaghan. Clustering data streams. In Foundations of computer science, 2000. proceedings. 41st annual symposium on, pages 359-366. IEEE, 2000. Google Scholar
  37. Sariel Har-Peled and Soham Mazumdar. On coresets for k-means and k-median clustering. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing (STOC), pages 291-300, 2004. Google Scholar
  38. Jonathan Huggins, Trevor Campbell, and Tamara Broderick. Coresets for scalable bayesian logistic regression. In Advances in Neural Information Processing Systems, pages 4080-4088, 2016. Google Scholar
  39. Piotr Indyk, Sepideh Mahabadi, Mohammad Mahdian, and Vahab S. Mirrokni. Composable core-sets for diversity and coverage maximization. In Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS), pages 100-108, 2014. Google Scholar
  40. Tsvi Kopelowitz and Ely Porat. Improved Algorithms for Polynomial-Time Decay and Time-Decay with Additive Error. In Theoretical Computer Science, 9th Italian Conference, ICTCS Proceedings, pages 309-322, 2005. Google Scholar
  41. Harry Lang. Online Facility Location on Semi-Random Streams. arXiv preprint, 2017. URL: http://arxiv.org/abs/1711.09384.
  42. Lap-Kei Lee and H. F. Ting. A simpler and more efficient deterministic scheme for finding frequent items over sliding windows. In Proceedings of the Twenty-Fifth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages 290-297, 2006. Google Scholar
  43. Adam Meyerson. Online facility location. In Foundations of Computer Science, 2001. Proceedings. 42nd IEEE Symposium on, pages 426-431. IEEE, 2001. Google Scholar
  44. Vahab S. Mirrokni and Morteza Zadimoghaddam. Randomized Composable Core-sets for Distributed Submodular Maximization. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing (STOC), pages 153-162, 2015. Google Scholar
  45. S. Muthukrishnan. Data Streams: Algorithms and Applications. Foundations and Trends in Theoretical Computer Science, 1(2), 2005. Google Scholar
  46. Tamas Sarlos. Improved approximation algorithms for large matrices via random projections. In Foundations of Computer Science, 2006. FOCS'06. 47th Annual IEEE Symposium on, pages 143-152. IEEE, 2006. Google Scholar
  47. Shang-Yu Su, Pei-Chieh Yuan, and Yun-Nung Chen. How time matters: Learning time-decay attention for contextual spoken language understanding in dialogues. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2133-2142, 2018. Google Scholar
  48. Yan Zheng and Jeff M Phillips. Coresets for Kernel Regression. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 645-654. ACM, 2017. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail