Beyond Logarithmic Bounds: Querying in Constant Expected Time with Learned Indexes

Croquevielle, Luis Alberto; Yang, Guang; Liang, Liang; Hadian, Ali; Heinis, Thomas

doi:10.4230/LIPIcs.ICDT.2025.19

Abstract

Learned indexes leverage machine learning models to accelerate query answering in databases, showing impressive practical performance. However, theoretical understanding of these methods remains incomplete. Existing research suggests that learned indexes have superior asymptotic complexity compared to their non-learned counterparts, but these findings have been established under restrictive probabilistic assumptions. Specifically, for a sorted array with n elements, it has been shown that learned indexes can find a key in O(log(log n)) expected time using at most linear space, compared with O(log n) for non-learned methods.
In this work, we prove O(1) expected time can be achieved with at most linear space, thereby establishing the tightest upper bound so far for the time complexity of an asymptotically optimal learned index. Notably, we use weaker probabilistic assumptions than prior research, meaning our work generalizes previous results. Furthermore, we introduce a new measure of statistical complexity for data. This metric exhibits an information-theoretical interpretation and can be estimated in practice. This characterization provides further theoretical understanding of learned indexes, by helping to explain why some datasets seem to be particularly challenging for these methods.

Alfred V. Aho and John E. Hopcroft. The Design and Analysis of Computer Algorithms. Addison-Wesley Longman Publishing Co., Inc., USA, 1st edition, 1974.
Jon Louis Bentley and Andrew Chi-Chih Yao. An almost optimal algorithm for unbounded searching. Information Processing Letters, 5(3):82-87, 1976. URL: https://doi.org/10.1016/0020-0190(76)90071-5.
Toby Berger. Rate-distortion theory. Wiley Encyclopedia of Telecommunications, 2003.
Christian Cachin. Entropy measures and unconditional security in cryptography. PhD thesis, ETH Zurich, 1997. URL: https://d-nb.info/950686247.
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, Third Edition. The MIT Press, Cambridge, MA, USA, 3rd edition, 2009.
Luis Croquevielle, Guang Yang, Liang Lian, Ali Hadian, and Thomas Heinis. Querying in constant expected time with learned indexes. arXiv preprint arXiv:2405.03851, 2024.
H.A. David and H.N. Nagaraja. Order Statistics. Wiley Series in Probability and Statistics. John Wiley & Sons, Inc., Hoboken, NJ, USA, 2004.
Jialin Ding, Umar Farooq Minhas, Jia Yu, Chi Wang, Jaeyoung Do, Yinan Li, Hantian Zhang, Badrish Chandramouli, Johannes Gehrke, Donald Kossmann, David Lomet, and Tim Kraska. Alex: An updatable adaptive learned index. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, SIGMOD '20, pages 969-984, New York, NY, USA, 2020. Association for Computing Machinery. URL: https://doi.org/10.1145/3318464.3389711.
Paolo Ferragina, Fabrizio Lillo, and Giorgio Vinciguerra. Why are learned indexes so effective? In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3123-3132, Virtual, 13-18 July 2020. PMLR. URL: http://proceedings.mlr.press/v119/ferragina20a.html.
Paolo Ferragina and Giorgio Vinciguerra. The PGM-index: A fully-dynamic compressed learned index with provable worst-case bounds. Proc. VLDB Endow., 13(8):1162-1175, April 2020. URL: https://doi.org/10.14778/3389133.3389135.
Alex Galakatos, Michael Markovitch, Carsten Binnig, Rodrigo Fonseca, and Tim Kraska. Fiting-tree: A data-aware index structure. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD '19, pages 1189-1206, New York, NY, USA, 2019. Association for Computing Machinery. URL: https://doi.org/10.1145/3299869.3319860.
Rodrigo González, Szymon Grabowski, Veli Mäkinen, and Gonzalo Navarro. Practical implementation of rank and select queries. In Poster Proc. Volume of 4th Workshop on Efficient and Experimental Algorithms (WEA), pages 27-38, Athens, Greece, 2005. CTI Press and Ellinika Grammata.
Goetz Graefe et al. Modern B-tree techniques. Foundations and Trends in Databases, 3(4):203-402, 2011. URL: https://doi.org/10.1561/1900000028.
Ali Hadian and Thomas Heinis. Shift-table: A low-latency learned index for range queries using model correction. In Yannis Velegrakis, Demetris Zeinalipour-Yazti, Panos K. Chrysanthis, and Francesco Guerra, editors, Proceedings of the 24th International Conference on Extending Database Technology, EDBT 2021, Nicosia, Cyprus, March 23 - 26, 2021, pages 253-264. OpenProceedings.org, 2021. URL: https://doi.org/10.5441/002/edbt.2021.23.
Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper, Tim Kraska, and Thomas Neumann. SOSD: A benchmark for learned indexes. arXiv preprint arXiv:1911.13014, 2019. URL: https://arxiv.org/abs/1911.13014.
Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper, Tim Kraska, and Thomas Neumann. Radixspline: A single-pass learned index. In Proceedings of the Third International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, aiDM '20, New York, NY, USA, 2020. Association for Computing Machinery. URL: https://doi.org/10.1145/3401071.3401659.
Donald E. Knuth. The Art of Computer Programming, Volume 3: (2nd Ed.) Sorting and Searching. Addison Wesley Longman Publishing Co., Inc., USA, 1998.
Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD '18, pages 489-504, New York, NY, USA, 2018. Association for Computing Machinery. URL: https://doi.org/10.1145/3183713.3196909.
Justin J. Levandoski, David B. Lomet, and Sudipta Sengupta. The bw-tree: A b-tree for new hardware platforms. In Proceedings of the 2013 IEEE International Conference on Data Engineering (ICDE 2013), ICDE '13, pages 302-313, USA, 2013. IEEE Computer Society. URL: https://doi.org/10.1109/ICDE.2013.6544834.
Ling Liu and M Tamer Özsu. Encyclopedia of Database Systems, volume 6. Springer New York, NY, USA, 2009.
Scott Lloyd and Maya Gokhale. Near memory key/value lookup acceleration. In Proceedings of the International Symposium on Memory Systems, MEMSYS '17, pages 26-33, New York, NY, USA, 2017. Association for Computing Machinery. URL: https://doi.org/10.1145/3132402.3132434.
Zhanyu Ma and Arne Leijon. Coding bounded support data with beta distribution. In 2010 2nd IEEE InternationalConference on Network Infrastructure and Digital Content, pages 246-250, 2010. URL: https://doi.org/10.1109/ICNIDC.2010.5657779.
Zhanyu Ma, Andrew E. Teschendorff, Arne Leijon, Yuanyuan Qiao, Honggang Zhang, and Jun Guo. Variational bayesian matrix factorization for bounded support data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(4):876-889, 2015. URL: https://doi.org/10.1109/TPAMI.2014.2353639.
Ryan Marcus, Andreas Kipf, Alexander van Renen, Mihail Stoian, Sanchit Misra, Alfons Kemper, Thomas Neumann, and Tim Kraska. Benchmarking learned indexes. Proc. VLDB Endow., 14(1):1-13, September 2020. URL: https://doi.org/10.14778/3421424.3421425.
Alfréd Rényi. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, volume 4, pages 547-562, Oakland, CA, USA, 1961. University of California Press.
Amirhesam Shahvarani and Hans-Arno Jacobsen. Parallel index-based stream join on a multicore cpu. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, SIGMOD '20, pages 2523-2537, New York, NY, USA, 2020. Association for Computing Machinery. URL: https://doi.org/10.1145/3318464.3380576.
Maciej Skórski. Shannon entropy versus renyi entropy from a cryptographic viewpoint. In Jens Groth, editor, Cryptography and Coding, pages 257-274, Cham, 2015. Springer International Publishing. URL: https://doi.org/10.1007/978-3-319-27239-9_16.
Zhaoyan Sun, Xuanhe Zhou, and Guoliang Li. Learned index: A comprehensive experimental evaluation. Proc. VLDB Endow., 16(8):1992-2004, April 2023. URL: https://doi.org/10.14778/3594512.3594528.
Tim van Erven and Peter Harremos. Rényi divergence and kullback-leibler divergence. IEEE Transactions on Information Theory, 60(7):3797-3820, 2014. URL: https://doi.org/10.1109/TIT.2014.2320500.
Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, UK, 2018.
Susana Vinga and Jonas S Almeida. Rényi continuous entropy of dna sequences. Journal of theoretical biology, 231(3):377-388, 2004.
Chaichon Wongkham, Baotong Lu, Chris Liu, Zhicong Zhong, Eric Lo, and Tianzheng Wang. Are updatable learned indexes ready? Proc. VLDB Endow., 15(11):3004-3017, July 2022. URL: https://doi.org/10.14778/3551793.3551848.
Guang Yang, Liang Liang, Ali Hadian, and Thomas Heinis. FLIRT: A fast learned index for rolling time frames. In Julia Stoyanovich, Jens Teubner, Nikos Mamoulis, Evaggelia Pitoura, Jan Mühlig, Katja Hose, Sourav S. Bhowmick, and Matteo Lissandrini, editors, Proceedings 26th International Conference on Extending Database Technology, EDBT 2023, Ioannina, Greece, March 28-31, 2023, pages 234-246, Ioannina, Greece, 2023. OpenProceedings.org. URL: https://doi.org/10.48786/edbt.2023.19.
Sepanta Zeighami and Cyrus Shahabi. On distribution dependent sub-logarithmic query time of learned indexing. In Proceedings of the 40th International Conference on Machine Learning, ICML'23. JMLR.org, 2023.

Beyond Logarithmic Bounds: Querying in Constant Expected Time with Learned Indexes

Authors Luis Alberto Croquevielle , Guang Yang , Liang Liang , Ali Hadian , Thomas Heinis

File

Document Identifiers

Author Details

Cite As Get BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

Beyond Logarithmic Bounds: Querying in Constant Expected Time with Learned Indexes

Authors Luis Alberto Croquevielle , Guang Yang , Liang Liang , Ali Hadian , Thomas Heinis

File

Document Identifiers

Author Details

Funding

Cite As Get BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Related Versions

References

Thanks for your feedback!

Could not send message