Document

The Space Complexity of Inner Product Filters

File

LIPIcs.ICDT.2020.22.pdf
• Filesize: 485 kB
• 14 pages

Acknowledgements

We would like to thank Thomas Dybdahl Ahle for suggesting a simple proof for an inequality in section 4, and the anonymous reviewers for many constructive suggestions, improving the presentation. Part of this work was done while the first author was visiting Google Research.

Cite As

Rasmus Pagh and Johan Sivertsen. The Space Complexity of Inner Product Filters. In 23rd International Conference on Database Theory (ICDT 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 155, pp. 22:1-22:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)
https://doi.org/10.4230/LIPIcs.ICDT.2020.22

Abstract

Motivated by the problem of filtering candidate pairs in inner product similarity joins we study the following inner product estimation problem: Given parameters d∈ℕ, α>β≥0 and unit vectors x,y∈ ℝ^d consider the task of distinguishing between the cases ⟨x,y⟩≤β and ⟨x,y⟩≥α where ⟨x,y⟩ = ∑_{i=1}^d x_i y_i is the inner product of vectors x and y. The goal is to distinguish these cases based on information on each vector encoded independently in a bit string of the shortest length possible. In contrast to much work on compressing vectors using randomized dimensionality reduction, we seek to solve the problem deterministically, with no probability of error. Inner product estimation can be solved in general via estimating ⟨x,y⟩ with an additive error bounded by ε = α - β. We show that d log₂ (√{1-β}/ε) ± Θ(d) bits of information about each vector is necessary and sufficient. Our upper bound is constructive and improves a known upper bound of d log₂(1/ε) + O(d) by up to a factor of 2 when β is close to 1. The lower bound holds even in a stronger model where one of the vectors is known exactly, and an arbitrary estimation function is allowed.

Subject Classification

ACM Subject Classification
• Theory of computation → Data structures design and analysis
• Theory of computation → Streaming, sublinear and near linear time algorithms
• Similarity
• estimation
• dot product
• filtering

Metrics

• Access Statistics
• Total Accesses (updated on a weekly basis)
0

References

1. Amir Abboud, Aviad Rubinstein, and R. Ryan Williams. Distributed PCP Theorems for Hardness of Approximation in P. In 58th IEEE Annual Symposium on Foundations of Computer Science (FOCS), pages 25-36, 2017. URL: https://doi.org/10.1109/FOCS.2017.12.
2. Thomas Dybdahl Ahle, Rasmus Pagh, Ilya Razenshteyn, and Francesco Silvestri. On the complexity of inner product similarity join. In Proceedings of the 35th ACM Symposium on Principles of Database Systems (PODS), pages 151-164. ACM, 2016.
3. Noga Alon and Bo'az Klartag. Optimal compression of approximate inner products and dimension reduction. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 639-650. IEEE, 2017.
4. David C. Anastasiu and George Karypis. L2AP: Fast cosine similarity search with prefix l-2 norm bounds. In 30th IEEE International Conference on Data Engineering (ICDE), pages 784-795. IEEE, 2014.
5. Nikolaus Augsten and Michael H. Böhlen. Similarity Joins in Relational Database Systems. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2013. URL: https://doi.org/10.2200/S00544ED1V01Y201310DTM038.
6. Martin Aumüller, Tobias Christiani, Rasmus Pagh, and Michael Vesterli. PUFFINN: parameterless and universally fast finding of nearest neighbors. In 27th Annual European Symposium on Algorithms (ESA), pages 10:1-10:16, 2019. URL: https://doi.org/10.4230/LIPIcs.ESA.2019.10.
7. Raghavendran Balu, Teddy Furon, and Hervé Jégou. Beyond "project and sign" for cosine estimation with binary codes. In IEEE International Conference on Acoustics, Speech and Signal Processing, (ICASSP), pages 6884-6888, 2014. URL: https://doi.org/10.1109/ICASSP.2014.6854934.
8. Roberto J. Bayardo, Yiming Ma, and Ramakrishnan Srikant. Scaling up all pairs similarity search. In Proceedings of 16th international conference on World Wide Web (WWW), pages 131-140. ACM, 2007.
9. Davis W. Blalock and John V. Guttag. Bolt: Accelerated data mining with fast vector compression. In Proceedings of 23rd ACM International Conference on Knowledge Discovery and Data Mining (KDD), pages 727-735. ACM, 2017.
10. Amit Chakrabarti and Oded Regev. An optimal lower bound on the communication complexity of gap-hamming-distance. SIAM Journal on Computing, 41(5):1299-1317, 2012.
11. Moses Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings on 34th Annual ACM Symposium on Theory of Computing (STOC), pages 380-388, 2002. URL: https://doi.org/10.1145/509907.509965.
12. Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, and Yixin Chen. Compressing neural networks with the hashing trick. In International Conference on Machine Learning (ICML), pages 2285-2294, 2015.
13. Tobias Christiani, Rasmus Pagh, and Johan Sivertsen. Scalable and Robust Set Similarity Join. In 34th IEEE International Conference on Data Engineering (ICDE), pages 1240-1243, 2018. URL: https://doi.org/10.1109/ICDE.2018.00120.
14. David Clayton, Christopher Patton, and Thomas Shrimpton. Probabilistic Data Structures in Adversarial Environments. In Proceedings of ACM Conference on Computer and Communications Security (CCS), pages 1317-1334, 2019. URL: https://doi.org/10.1145/3319535.3354235.
15. Thomas Fischer. A pyramid vector quantizer. IEEE transactions on information theory, 32(4):568-583, 1986.
16. Allen Gersho and Robert M. Gray. Vector quantization and signal compression, volume 159. Springer Science & Business Media, 2012.
17. Ruiqi Guo, Sanjiv Kumar, Krzysztof Choromanski, and David Simcha. Quantization based fast inner product search. In Artificial Intelligence and Statistics, pages 482-490, 2016.
18. Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In International Conference on Machine Learning (ICML), pages 1737-1746, 2015.
19. Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. arXiv preprint arXiv:1510.00149, 2015.
20. Piotr Indyk, Ilya Razenshteyn, and Tal Wagner. Practical Data-Dependent Metric Compression with Provable Guarantees. In Advances in Neural Information Processing Systems (NIPS), pages 2614-2623, 2017.
21. Piotr Indyk and Tal Wagner. Near-optimal (euclidean) metric compression. In Proceedings of 28th ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 710-723. SIAM, 2017.
22. Piotr Indyk and Tal Wagner. Approximate Nearest Neighbors in Limited Space. In Proceedings of 31st Conference On Learning Theory (COLT), volume 75 of Proceedings of Machine Learning Research, pages 2012-2036. PMLR, 2018.
23. Piotr Indyk and David P. Woodruff. Tight lower bounds for the distinct elements problem. In 44th IEEE Symposium on Foundations of Computer Science (FOCS), pages 283-288. IEEE, 2003.
24. Thathachar S. Jayram and David P. Woodruff. Optimal bounds for Johnson-Lindenstrauss transforms and streaming problems with subconstant error. ACM Transactions on Algorithms (TALG), 9(3):26, 2013.
25. Herve Jegou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence, 33(1):117-128, 2011.
26. William B. Johnson and Joram Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. Contemporary mathematics, 26(189-206):1, 1984.
27. Daniel Kane, Raghu Meka, and Jelani Nelson. Almost optimal explicit Johnson-Lindenstrauss families. In International Workshop on Randomization and Approximation Techniques in Computer Science (RANDOM), pages 628-639. Springer, 2011.
28. Urs Köster, Tristan Webb, Xin Wang, Marcel Nassar, Arjun K. Bansal, William Constable, Oguz Elibol, Stewart Hall, Luke Hornof, Amir Khosrowshahi, Carey Kloss, Ruby J. Pai, and Naveen Rao. Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks. In Advances in Neural Information Processing Systems (NIPS), pages 1742-1752, 2017. URL: http://papers.nips.cc/paper/6771-flexpoint-an-adaptive-numerical-format-for-efficient-training-of-deep-neural-networks.
29. Samuel McCauley and Francesco Silvestri. Adaptive MapReduce Similarity Joins. In Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, page 4. ACM, 2018.
30. Rasmus Pagh. Large-Scale Similarity Joins With Guarantees (Invited Talk). In 18th International Conference on Database Theory (ICDT), pages 15-24, 2015. URL: https://doi.org/10.4230/LIPIcs.ICDT.2015.15.
31. Rasmus Pagh, Ninh Pham, Francesco Silvestri, and Morten Stöckel. I/O-efficient similarity join. In 23rd Annual European Symposium on Algorithms (ESA), pages 941-952, 2015. URL: https://doi.org/10.1007/978-3-662-48350-3_78.
32. Gilles Pisier. The volume of convex bodies and Banach space geometry, volume 94 of Cambridge Tracts in Mathematics. Cambridge University Press, 1999.
33. Yuval Rabani and Amir Shpilka. Explicit construction of a small ε-net for linear threshold functions. SIAM Journal on Computing, 39(8):3501-3520, 2010.
34. Parikshit Ram and Alexander G. Gray. Maximum inner-product search using cone trees. In 18th ACM International Conference on Knowledge Discovery and Data Mining (KDD), pages 931-939, 2012. URL: https://doi.org/10.1145/2339530.2339677.
35. Venu Satuluri and Srinivasan Parthasarathy. Bayesian locality sensitive hashing for fast similarity search. Proceedings of the VLDB Endowment, 5(5):430-441, 2012.
36. Alexander A. Sherstov. The Communication Complexity of Gap Hamming Distance. Theory of Computing, 8(1):197-208, 2012.
37. Anshumali Shrivastava and Ping Li. Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS). In Advances in Neural Information Processing Systems (NIPS), pages 2321-2329, 2014. URL: http://papers.nips.cc/paper/5329-asymmetric-lsh-alsh-for-sublinear-time-maximum-inner-product-search-mips.
38. Ryan Spring and Anshumali Shrivastava. Scalable and Sustainable Deep Learning via Randomized Hashing. In Proceedings of 23rd ACM International Conference on Knowledge Discovery and Data Mining (KDD), pages 445-454, 2017. URL: https://doi.org/10.1145/3097983.3098035.
39. Christina Teflioudi and Rainer Gemulla. Exact and approximate maximum inner product search with LEMP. ACM Transactions on Database Systems (TODS), 42(1):5, 2017.
40. Raimundas Vidunas. EXPRESSIONS FOR VALUES OF THE GAMMA FUNCTION. Kyushu Journal of Mathematics, 59(2):267-283, 2005.
41. Jun Wang, Wei Liu, Sanjiv Kumar, and Shih-Fu Chang. Learning to hash for indexing big data—a survey. Proceedings of the IEEE, 104(1):34-57, 2016.
42. Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. Feature hashing for large scale multitask learning. In Proceedings of 26th International Conference on Machine Learning (ICML), pages 1113-1120. ACM, 2009.
43. Xiao Yan, Jinfeng Li, Xinyan Dai, Hongzhi Chen, and James Cheng. Norm-Ranging LSH for Maximum Inner Product Search. In Advances in Neural Information Processing Systems (NeurIPS), pages 2956-2965, 2018. URL: http://papers.nips.cc/paper/7559-norm-ranging-lsh-for-maximum-inner-product-search.
X

Feedback for Dagstuhl Publishing