A Comparative Study of Compressed, Learned, and Traditional Indexing Methods for Integer Data

Bellomo, Lorenzo; Cianci, Giuseppe; de Rosa, Luca; Ferragina, Paolo; Odorisio, Mattia

doi:10.4230/LIPIcs.SEA.2025.5

Abstract

The rapid evolution of learned data structures has revolutionized database indexing, particularly for sorted integer datasets. While learned indexes excel in static scenarios due to their low memory footprint, reduced storage requirements, and fast lookup times, benchmarks like SOSD and TLI have largely overlooked compressed indexes and SIMD-based implementations of traditional indexes. This paper addresses this gap by introducing a comprehensive benchmarking framework that (i) evaluates traditional, learned, and compressed indexes across 12 datasets (real and synthetic) of varying types and sizes; (ii) integrates state-of-the-art SIMD-enhanced B-Tree variants; and (iii) measures critical performance metrics such as memory usage, construction time, and lookup efficiency. Our findings reveal that while learned indexes minimize memory usage, a feature useful when internal memory constraints are mandatory, SIMD-enhanced B-Trees consistently achieve superior lookup times with comparable extra space. On the other hand, compressed indexes like LA-vector and EliasFano provide very effective compression of the indexed data with slower access speeds (2x-3x). Another contribution of this paper is a publicly available benchmarking framework (composed of code and datasets) that makes our experiments reproducible and extensible to other indexes and datasets.

Antonio Boffa, Paolo Ferragina, and Giorgio Vinciguerra. A "learned" approach to quicken and compress rank/select dictionaries. In Proceedings of the Symposium on Algorithm Engineering and Experiments (ALENEX), pages 46-59, 2021. URL: https://doi.org/10.1137/1.9781611976472.4.
Antonio Boffa, Paolo Ferragina, and Giorgio Vinciguerra. A learned approach to design compressed rank/select data structures. ACM Trans. Algorithms, 18(3), 2022. URL: https://doi.org/10.1145/3524060.
Subarna Chatterjee, Mark F. Pekala, Lev Kruglyak, and Stratos Idreos. Limousine: Blending learned and classical indexes to self-design larger-than-memory cloud storage engines. Proceedings ACM Manag. Data, 2(1), March 2024. URL: https://doi.org/10.1145/3639302.
Supawit Chockchowwat, Wenjie Liu, and Yongjoo Park. Airindex: Versatile index tuning through data and storage. Proceedings ACM Manag. Data, 1(3), November 2023. URL: https://doi.org/10.1145/3617308.
Andrew Crotty. Hist-tree: Those who ignore it are doomed to learn. In Conference on Innovative Data Systems Research, 2021. URL: https://api.semanticscholar.org/CorpusID:231400989.
Jialin Ding, Umar Farooq Minhas, Jia Yu, Chi Wang, Jaeyoung Do, Yinan Li, Hantian Zhang, Badrish Chandramouli, Johannes Gehrke, Donald Kossmann, David Lomet, and Tim Kraska. Alex: An updatable adaptive learned index. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 969-984. ACM, 2020. URL: https://doi.org/10.1145/3318464.3389711.
P. Elias. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, 21(2):194-203, 1975. URL: https://doi.org/10.1109/TIT.1975.1055349.
Peter Elias. Efficient storage and retrieval by content and address of static files. J. ACM, 21(2):246-260, April 1974. URL: https://doi.org/10.1145/321812.321820.
R.M. Fano. On the Number of Bits Required to Implement an Associative Memory. Computation Structures Group Memo. MIT Project MAC Computer Structures Group, 1971. URL: https://books.google.it/books?id=07DeGwAACAAJ.
Paolo Ferragina. Pearls of Algorithm Engineering. Cambridge University Press, 2023.
Paolo Ferragina and Giorgio Vinciguerra. Learned Data Structures, pages 5-41. Springer International Publishing, Cham, 2020. URL: https://doi.org/10.1007/978-3-030-43883-8_2.
Paolo Ferragina and Giorgio Vinciguerra. The pgm-index: a fully-dynamic compressed learned index with provable worst-case bounds. Proceedings VLDB Endow., 13(8):1162-1175, 2020. URL: https://doi.org/10.14778/3389133.3389135.
Simon Gog, Timo Beller, Alistair Moffat, and Matthias Petri. From theory to practice: Plug and play with succinct data structures. In 13th International Symposium on Experimental Algorithms (SEA), pages 326-337, 2014. URL: https://doi.org/10.1007/978-3-319-07959-2_28.
Goetz Graefe. More modern b-tree techniques. Foundations and Trends® in Databases, 13(3):169-249, 2024. URL: https://doi.org/10.1561/1900000070.
Changkyu Kim, Jatin Chhugani, Nadathur Satish, Eric Sedlar, Anthony D. Nguyen, Tim Kaldewey, Victor W. Lee, Scott A. Brandt, and Pradeep Dubey. FAST: fast architecture sensitive tree search on modern cpus and gpus. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD '10, pages 339-350, 2010. URL: https://doi.org/10.1145/1807167.1807206.
Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper, Tim Kraska, and Thomas Neumann. Sosd: A benchmark for learned indexes. NeurIPS Workshop on Machine Learning for Systems, 2019.
Ragnar Groot Koerkamp. Static search trees: 40x faster than binary search. https://curiouscoding.nl/posts/static-search-tree/, 2024.
Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. The case for learned index structures. In Proceedings of the International Conference on Management of Data (SIGMOD), pages 489-504. ACM, 2018. URL: https://doi.org/10.1145/3183713.3196909.
Daniel Lemire, Owen Kaser, Nathan Kurz, Luca Deri, Chris O’Hara, François Saint‐Jacques, and Gregory Ssi‐Yan‐Kai. Roaring bitmaps: Implementation of an optimized software library. Software: Practice and Experience, 48(4):867-895, 2018. URL: https://doi.org/10.1002/spe.2560.
Qiyu Liu, Siyuan Han, Yanlin Qi, Jingshu Peng, Jin Li, Longlong Lin, and Lei Chen. Why are learned indexes so effective but sometimes ineffective? arXiv, 2024. URL: https://doi.org/10.48550/arXiv.2410.00846.
Ryan Marcus, Andreas Kipf, Alexander van Renen, Mihail Stoian, Sanchit Misra, Alfons Kemper, Thomas Neumann, and Tim Kraska. Benchmarking learned indexes. Proceedings VLDB Endow., 14(1):1-13, 2020. URL: https://doi.org/10.14778/3421424.3421425.
Ryan Marcus, Emily Zhang, and Tim Kraska. Cdfshop: Exploring and optimizing learned index structures. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 2789-2792. ACM, 2020. URL: https://doi.org/10.1145/3318464.3384706.
Gonzalo Navarro. Compact data structures: a practical approach. Cambridge University Press, 2016.
Giuseppe Ottaviano, Nicola Tonellotto, and Rossano Venturini. Optimal space-time tradeoffs for inverted indexes. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, WSDM, pages 47-56. ACM, 2015. URL: https://doi.org/10.1145/2684822.2685297.
Giuseppe Ottaviano and Rossano Venturini. Partitioned elias-fano indexes. In Proceedings of the 37th ACM SIGIR Conference on Research & Development in Information Retrieval, pages 273-282, 2014. URL: https://doi.org/10.1145/2600428.2609615.
Giulio Ermanno Pibiri and Rossano Venturini. Techniques for inverted index compression. ACM Comput. Surv., 53(6), December 2020. URL: https://doi.org/10.1145/3415148.
Jun Rao and Kenneth A. Ross. Cache conscious indexing for decision-support in main memory. In Proceedings of the 25th International Conference on Very Large Data Bases (VLDB), pages 78-89. Morgan Kaufmann Publishers Inc., 1999. URL: http://www.vldb.org/conf/1999/P7.pdf.
Sergey Slotin. Binary search. https://en.algorithmica.org/hpc/data-structures/binary-search/, 2021.
Sergey Slotin. Static b-trees. https://en.algorithmica.org/hpc/data-structures/s-tree/, 2021.
StackOverflow. How to generate zipf distributed numbers efficiently? https://stackoverflow.com/questions/9983239/how-to-generate-zipf-distributed-numbers-efficiently, 2012. Accessed: 03/10/2024.
Mihail Stoian, Andreas Kipf, Ryan Marcus, and Tim Kraska. PLEX: towards practical learned indexing. CoRR, abs/2108.05117, 2021. URL: https://arxiv.org/abs/2108.05117.
Zhaoyan Sun, Xuanhe Zhou, and Guoliang Li. Learned index: A comprehensive experimental evaluation. Proceedings VLDB Endow., 16(8):1992-2004, April 2023. URL: https://doi.org/10.14778/3594512.3594528.
Jiacheng Wu, Yong Zhang, Shimin Chen, Yu Chen, Jin Wang, and Chunxiao Xing. Updatable learned index with precise positions. Proceedings VLDB Endow., 14(8):1276-1288, 2021. URL: https://doi.org/10.14778/3457390.3457393.
Jiaoyi Zhang, Kai Su, and Huanchen Zhang. Making in-memory learned indexes efficient on disk. Proceedings ACM Manag. Data, 2(3), May 2024. URL: https://doi.org/10.1145/3654954.

A Comparative Study of Compressed, Learned, and Traditional Indexing Methods for Integer Data

Authors Lorenzo Bellomo , Giuseppe Cianci, Luca de Rosa, Paolo Ferragina , Mattia Odorisio

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

A Comparative Study of Compressed, Learned, and Traditional Indexing Methods for Integer Data

Authors Lorenzo Bellomo , Giuseppe Cianci, Luca de Rosa, Paolo Ferragina , Mattia Odorisio

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

Supplementary Materials

References

Thanks for your feedback!

Could not send message