U-Index: A Universal Indexing Framework for Matching Long Patterns

Ayad, Lorraine A. K.; Fici, Gabriele; Groot Koerkamp, Ragnar; Loukides, Grigorios; Patro, Rob; Pibiri, Giulio Ermanno; Pissis, Solon P.

doi:10.4230/LIPIcs.SEA.2025.4

Abstract

Motivation. Text indexing is a fundamental and well-studied problem. Classic solutions to this problem either replace the original text with a compressed representation, e.g., the FM-index and its variants, or keep it uncompressed but attach some redundancy - an index - to accelerate matching, e.g., the suffix array. The former solutions thus retain excellent compressed space, but are practically slow to construct and query. The latter approaches, instead, sacrifice space efficiency but are typically faster; for example, the suffix array takes much more space than the text itself for commonly used alphabets, like ASCII or DNA, but it is very fast to construct and query.
Methods. In this paper, we show that efficient text indexing can be achieved using just a small extra space on top of the original text, provided that the query patterns are sufficiently long. More specifically, we develop a new indexing paradigm in which a sketch of a query pattern is first matched against a sketch of the text. Once candidate matches are retrieved, they are verified using the original text. This paradigm is thus universal in the sense that it allows us to use any solution to index the sketched text, like a suffix array, FM-index, or r-index.
Results. We explore both the theory and the practice of this universal framework. With an extensive experimental analysis, we show that, surprisingly, universal indexes can be constructed much faster than their unsketched counterparts and take a fraction of the space, as a direct consequence of (i) having a lower bound on the length of patterns and (ii) working in sketch space. Furthermore, these data structures have the potential of retaining or even improving query time, because matching against the sketched text is faster and verifying candidates can be theoretically done in constant time per occurrence (or, in practice, by short and cache-friendly scans of the text).
Finally, we discuss some important applications of this novel indexing paradigm to computational biology. We hypothesize that such indexes will be particularly effective when the queries are sufficiently long, and so we demonstrate applications in long-read mapping.

Tim Anderson and Travis J Wheeler. An optimized FM-index library for nucleotide and amino acid search. Algorithms for Molecular Biology, 16(1):25, 2021. URL: https://doi.org/10.1186/S13015-021-00204-6.
Lorraine A. K. Ayad, Gabriele Fici, Ragnar Groot Koerkamp, Grigorios Loukides, Rob Patro, Giulio Ermanno Pibiri, and Solon P. Pissis. U-index: A universal indexing framework for matching long patterns. CoRR, abs/2502.14488, 2025. URL: https://doi.org/10.48550/arXiv.2502.14488.
Lorraine A. K. Ayad, Grigorios Loukides, and Solon P. Pissis. Text indexing for long patterns: Anchors are all you need. Proc. VLDB Endow., 16(9):2117-2131, 2023. URL: https://doi.org/10.14778/3598581.3598586.
Lorraine A. K. Ayad, Grigorios Loukides, and Solon P. Pissis. Text indexing for long patterns using locally consistent anchors. CoRR, abs/2407.11819, 2024. URL: https://doi.org/10.48550/arXiv.2407.11819.
Lorraine A. K. Ayad, Grigorios Loukides, Solon P. Pissis, and Hilde Verbeek. Sparse suffix and LCP array: Simple, direct, small, and fast. In José A. Soto and Andreas Wiese, editors, LATIN 2024: Theoretical Informatics - 16th Latin American Symposium, Puerto Varas, Chile, March 18-22, 2024, Proceedings, Part I, volume 14578 of Lecture Notes in Computer Science, pages 162-177. Springer, 2024. URL: https://doi.org/10.1007/978-3-031-55598-5_11.
Christina Boucher, Travis Gagie, Alan Kuhnle, Ben Langmead, Giovanni Manzini, and Taher Mun. Prefix-free parsing for building big BWTs. Algorithms for Molecular Biology, 14(1), May 2019. URL: https://doi.org/10.1186/s13015-019-0148-5.
Michael Burrows and David Wheeler. A block-sorting lossless data compression algorithm. Technical report, Systems Research Center, 1994.
Panagiotis Charalampopoulos, Costas S. Iliopoulos, Chang Liu, and Solon P. Pissis. Property suffix array with applications in indexing weighted sequences. ACM J. Exp. Algorithmics, 25:1-16, 2020. URL: https://doi.org/10.1145/3385898.
Francisco Claude, Gonzalo Navarro, Hannu Peltola, Leena Salmela, and Jorma Tarhio. String matching with alphabet sampling. J. Discrete Algorithms, 11:37-50, 2012. URL: https://doi.org/10.1016/J.JDA.2010.09.004.
Martin Dietzfelbinger, Joseph Gil, Yossi Matias, and Nicholas Pippenger. Polynomial hash functions are reliable (extended abstract). In Werner Kuich, editor, Automata, Languages and Programming, 19th International Colloquium, ICALP92, Vienna, Austria, July 13-17, 1992, Proceedings, volume 623 of Lecture Notes in Computer Science, pages 235-246. Springer, 1992. URL: https://doi.org/10.1007/3-540-55719-9_77.
Robert Edgar. Syncmers are more sensitive than minimizers for selecting conserved k‑mers in biological sequences. PeerJ, 9(e10805):1755-1771, 2021. URL: https://doi.org/10.7717/peerj.10805.
Peter Elias. Efficient storage and retrieval by content and address of static files. J. ACM, 21(2):246-260, 1974. URL: https://doi.org/10.1145/321812.321820.
Robert Mario Fano. On the number of bits required to implement an associative memory. Memorandum 61, Computer Structures Group, MIT, 1971.
Paolo Ferragina and Giovanni Manzini. Indexing compressed text. J. ACM, 52(4):552-581, 2005. URL: https://doi.org/10.1145/1082036.1082039.
Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM, 67(1):2:1-2:54, 2020. URL: https://doi.org/10.1145/3375890.
Simon Gog, Timo Beller, Alistair Moffat, and Matthias Petri. From theory to practice: Plug and play with succinct data structures. In Joachim Gudmundsson and Jyrki Katajainen, editors, Experimental Algorithms - 13th International Symposium, SEA 2014, Copenhagen, Denmark, June 29 - July 1, 2014. Proceedings, volume 8504 of Lecture Notes in Computer Science, pages 326-337. Springer, 2014. URL: https://doi.org/10.1007/978-3-319-07959-2_28.
Szymon Grabowski and Marcin Raniszewski. Sampled suffix array with minimizers. Softw. Pract. Exp., 47(11):1755-1771, 2017. URL: https://doi.org/10.1002/SPE.2481.
Deborah L Grady, Robert L Ratliff, Donna L Robinson, Erin C McCanlies, Julianne Meyne, and Robert K Moyzis. Highly conserved repetitive DNA sequences are present at human centromeres. Proceedings of the National Academy of Sciences, 89(5):1695-1699, 1992.
Ragnar Groot Koerkamp and Giulio Ermanno Pibiri. The mod-minimizer: A Simple and Efficient Sampling Algorithm for Long k-mers. In Solon P. Pissis and Wing-Kin Sung, editors, 24th International Workshop on Algorithms in Bioinformatics (WABI 2024), volume 312 of Leibniz International Proceedings in Informatics (LIPIcs), pages 11:1-11:23, Dagstuhl, Germany, 2024. Schloss Dagstuhl - Leibniz-Zentrum für Informatik. URL: https://doi.org/10.4230/LIPIcs.WABI.2024.11.
Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput., 35(2):378-407, 2005. URL: https://doi.org/10.1137/S0097539702402354.
Dan Gusfield. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology. Cambridge University Press, 1997. URL: https://doi.org/10.1017/CBO9780511574931.
Aaron Hong, Marco Oliva, Dominik Köppl, Hideo Bannai, Christina Boucher, and Travis Gagie. Acceleration of fm-index queries through prefix-free parsing. In Djamal Belazzougui and Aïda Ouangraoua, editors, 23rd International Workshop on Algorithms in Bioinformatics, WABI 2023, September 4-6, 2023, Houston, TX, USA, volume 273 of LIPIcs, pages 13:1-13:16. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2023. URL: https://doi.org/10.4230/LIPICS.WABI.2023.13.
Juha Kärkkäinen, Giovanni Manzini, and Simon J. Puglisi. Permuted longest-common-prefix array. In Gregory Kucherov and Esko Ukkonen, editors, Combinatorial Pattern Matching, 20th Annual Symposium, CPM 2009, Lille, France, June 22-24, 2009, Proceedings, volume 5577 of Lecture Notes in Computer Science, pages 181-192. Springer, 2009. URL: https://doi.org/10.1007/978-3-642-02441-2_17.
Juha Kärkkäinen, Peter Sanders, and Stefan Burkhardt. Linear work suffix array construction. J. ACM, 53(6):918-936, 2006. URL: https://doi.org/10.1145/1217856.1217858.
Richard M. Karp and Michael O. Rabin. Efficient randomized pattern-matching algorithms. IBM J. Res. Dev., 31(2):249-260, 1987. URL: https://doi.org/10.1147/RD.312.0249.
Toru Kasai, Gunho Lee, Hiroki Arimura, Setsuo Arikawa, and Kunsoo Park. Linear-time longest-common-prefix computation in suffix arrays and its applications. In Amihood Amir and Gad M. Landau, editors, Combinatorial Pattern Matching, 12th Annual Symposium, CPM 2001 Jerusalem, Israel, July 1-4, 2001 Proceedings, volume 2089 of Lecture Notes in Computer Science, pages 181-192. Springer, 2001. URL: https://doi.org/10.1007/3-540-48194-X_17.
Dominik Kempa and Tomasz Kociumaka. Breaking the 𝒪(n)-barrier in the construction of compressed suffix arrays and suffix trees. In Nikhil Bansal and Viswanath Nagarajan, editors, Proceedings of the 2023 ACM-SIAM Symposium on Discrete Algorithms, SODA 2023, Florence, Italy, January 22-25, 2023, pages 5122-5202. SIAM, 2023. URL: https://doi.org/10.1137/1.9781611977554.CH187.
Dominik Kempa and Tomasz Kociumaka. Collapsing the hierarchy of compressed data structures: Suffix arrays in optimal compressed space. In 64th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2023, Santa Cruz, CA, USA, November 6-9, 2023, pages 1877-1886. IEEE, 2023. URL: https://doi.org/10.1109/FOCS57990.2023.00114.
Grigorios Loukides and Solon P. Pissis. Bidirectional string anchors: A new string sampling mechanism. In Petra Mutzel, Rasmus Pagh, and Grzegorz Herman, editors, 29th Annual European Symposium on Algorithms, ESA 2021, September 6-8, 2021, Lisbon, Portugal (Virtual Conference), volume 204 of LIPIcs, pages 64:1-64:21. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021. URL: https://doi.org/10.4230/LIPICS.ESA.2021.64.
Grigorios Loukides, Solon P. Pissis, and Michelle Sweering. Bidirectional string anchors for improved text indexing and top-K similarity search. IEEE Trans. Knowl. Data Eng., 35(11):11093-11111, 2023. URL: https://doi.org/10.1109/TKDE.2022.3231780.
Udi Manber and Eugene W. Myers. Suffix arrays: A new method for on-line string searches. SIAM J. Comput., 22(5):935-948, 1993. URL: https://doi.org/10.1137/0222058.
Gonzalo Navarro and Veli Mäkinen. Compressed full-text indexes. ACM Comput. Surv., 39(1):2, 2007. URL: https://doi.org/10.1145/1216370.1216372.
Ge Nong, Sen Zhang, and Wai Hong Chan. Two efficient algorithms for linear time suffix array construction. IEEE Trans. Computers, 60(10):1471-1484, 2011. URL: https://doi.org/10.1109/TC.2010.188.
Michael Roberts, Wayne Hayes, Brian R. Hunt, Stephen M. Mount, and James A. Yorke. Reducing storage requirements for biological sequence comparison. Bioinform., 20(18):3363-3369, 2004. URL: https://doi.org/10.1093/bioinformatics/bth408.
Saul Schleimer, Daniel Shawcross Wilkerson, and Alexander Aiken. Winnowing: Local algorithms for document fingerprinting. In Alon Y. Halevy, Zachary G. Ives, and AnHai Doan, editors, Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, San Diego, California, USA, June 9-12, 2003, pages 76-85. ACM, 2003. URL: https://doi.org/10.1145/872757.872770.
Hongyu Zheng, Carl Kingsford, and Guillaume Marçais. Improved design and analysis of practical minimizers. Bioinform., 36(Supplement-1):i119-i127, 2020. URL: https://doi.org/10.1093/BIOINFORMATICS/BTAA472.

U-Index: A Universal Indexing Framework for Matching Long Patterns

Authors Lorraine A. K. Ayad , Gabriele Fici , Ragnar Groot Koerkamp , Grigorios Loukides , Rob Patro , Giulio Ermanno Pibiri , Solon P. Pissis

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

U-Index: A Universal Indexing Framework for Matching Long Patterns

Authors Lorraine A. K. Ayad , Gabriele Fici , Ragnar Groot Koerkamp , Grigorios Loukides , Rob Patro , Giulio Ermanno Pibiri , Solon P. Pissis

File

Document Identifiers

Related Versions

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

Supplementary Materials

References

Thanks for your feedback!

Could not send message