Repetition Aware Text Indexing for Matching Patterns with Wildcards

Gibney, Daniel; Huffstutler, Jackson; Parthasarathi, Mano Prakash; Thankachan, Sharma V.

doi:10.4230/LIPIcs.ICALP.2025.88

Abstract

We study the problem of indexing a text T[1..n] to support pattern matching with wildcards. The input of a query is a pattern P[1..m] containing h ∈ [0, k] wildcard (a.k.a. don't care) characters and the output is the set of occurrences of P in T (i.e., starting positions of substrings of T that matches P), where k = o(log n) is fixed at index construction. A classic solution by Cole et al. [STOC 2004] provides an index with space complexity O(n ⋅ (clog n)^k/k!)) and query time O(m+2^h log log n+occ), where c > 1 is a constant, and occ denotes the number of occurrences of P in T. We introduce a new data structure that significantly reduces space usage for highly repetitive texts while maintaining efficient query processing. Its space (in words) and query time are as follows: 
O(δ log (n/δ)⋅ c^k (1+(log^k (δ log n))/k!)) and O((m+2^h +occ)log n))
The parameter δ, known as substring complexity, is a recently introduced measure of repetitiveness that serves as a unifying and lower-bounding metric for several popular measures, including the number of phrases in the LZ77 factorization (denoted by z) and the number of runs in the Burrows-Wheeler Transform (denoted by r). Moreover, O(δ log (n/δ)) represents the optimal space required to encode the data in terms of n and δ, helping us see how close our space is to the minimum required. In another trade-off, we match the query time of Cole et al.’s index using O(n+δ log (n/δ) ⋅ (clogδ)^{k+ε}/k!) space, where ε > 0 is an arbitrarily small constant. We also demonstrate how these techniques can be applied to a more general indexing problem, where the query pattern includes k-gaps (a gap can be interpreted as a contiguous sequence of wildcard characters).

Paniz Abedin, Oliver A. Chubet, Daniel Gibney, and Sharma V. Thankachan. Contextual pattern matching in less space. In Data Compression Conference, DCC 2023, Snowbird, UT, USA, March 21-24, 2023, pages 160-167. IEEE, 2023. URL: https://doi.org/10.1109/DCC55655.2023.00024.
Georgii M Adel’son-Vel’skii. An algorithm for the organization of information. Soviet Math., 3:1259-1263, 1962.
Djamal Belazzougui, Fabio Cunial, Travis Gagie, Nicola Prezza, and Mathieu Raffinot. Composite repetition-aware data structures. In Combinatorial Pattern Matching - 26th Annual Symposium, CPM 2015, Ischia Island, Italy, June 29 - July 1, 2015, Proceedings, volume 9133 of Lecture Notes in Computer Science, pages 26-39. Springer, 2015. URL: https://doi.org/10.1007/978-3-319-19929-0_3.
Djamal Belazzougui and Gonzalo Navarro. Alphabet-independent compressed text indexing. ACM Trans. Algorithms, 10(4):23:1-23:19, 2014. URL: https://doi.org/10.1145/2635816.
Philip Bille, Inge Li Gørtz, Moshe Lewenstein, Solon P. Pissis, Eva Rotenberg, and Teresa Anna Steiner. Gapped string indexing in subquadratic space and sublinear query time. In 41st International Symposium on Theoretical Aspects of Computer Science, STACS 2024, volume 289 of LIPIcs, pages 16:1-16:21. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2024. URL: https://doi.org/10.4230/LIPICS.STACS.2024.16.
Philip Bille, Inge Li Gørtz, Hjalte Wedel Vildhøj, and Søren Vind. String indexing for patterns with wildcards. Theory Comput. Syst., 55(1):41-60, 2014. URL: https://doi.org/10.1007/S00224-013-9498-4.
Michael Burrows and D J Wheeler. A block-sorting lossless data compression algorithm. In , 1994. URL: https://api.semanticscholar.org/CorpusID:2167441.
Timothy M. Chan, Kasper Green Larsen, and Mihai Pătraşcu. Orthogonal range searching on the RAM, revisited. In Proceedings of the 27th ACM Symposium on Computational Geometry, Paris, France, June 13-15, 2011, pages 1-10. ACM, 2011. URL: https://doi.org/10.1145/1998196.1998198.
Timothy M. Chan and Konstantinos Tsakalidis. Dynamic orthogonal range searching on the RAM, revisited. J. Comput. Geom., 9(2):45-66, 2018. URL: https://doi.org/10.20382/JOCG.V9I2A5.
Richard Cole, Lee-Ad Gottlieb, and Moshe Lewenstein. Dictionary matching and indexing with errors and don't cares. In László Babai, editor, Proceedings of the 36th Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, June 13-16, 2004, pages 91-100. ACM, 2004. URL: https://doi.org/10.1145/1007352.1007374.
Martin Farach. Optimal suffix tree construction with large alphabets. In 38th Annual Symposium on Foundations of Computer Science, FOCS '97, Miami Beach, Florida, USA, October 19-22, 1997, pages 137-143. IEEE Computer Society, 1997. URL: https://doi.org/10.1109/SFCS.1997.646102.
Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In 41st Annual Symposium on Foundations of Computer Science, FOCS 2000, 12-14 November 2000, Redondo Beach, California, USA, pages 390-398. IEEE Computer Society, 2000. URL: https://doi.org/10.1109/SFCS.2000.892127.
Travis Gagie, Pawel Gawrychowski, Juha Kärkkäinen, Yakov Nekrich, and Simon J. Puglisi. LZ77-based self-indexing with faster pattern matching. In LATIN 2014: Theoretical Informatics - 11th Latin American Symposium, Montevideo, Uruguay, March 31 - April 4, 2014. Proceedings, volume 8392 of Lecture Notes in Computer Science, pages 731-742. Springer, 2014. URL: https://doi.org/10.1007/978-3-642-54423-1_63.
Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM, 67(1):2:1-2:54, 2020. URL: https://doi.org/10.1145/3375890.
Arnab Ganguly, Daniel Gibney, Paul Macnichol, and Sharma V. Thankachan. Bounded-ratio gapped string indexing. In String Processing and Information Retrieval - 31st International Symposium, SPIRE 2024, Puerto Vallarta, Mexico, September 23-25, 2024, Proceedings, volume 14899 of Lecture Notes in Computer Science, pages 118-126. Springer, 2024. URL: https://doi.org/10.1007/978-3-031-72200-4_9.
Daniel Gibney, Paul Macnichol, and Sharma V. Thankachan. Non-overlapping indexing in BWT-runs bounded space. In String Processing and Information Retrieval - 30th International Symposium, SPIRE 2023, Pisa, Italy, September 26-28, 2023, Proceedings, volume 14240 of Lecture Notes in Computer Science, pages 260-270. Springer, 2023. URL: https://doi.org/10.1007/978-3-031-43980-3_21.
Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput., 35(2):378-407, 2005. URL: https://doi.org/10.1137/S0097539702402354.
Dov Harel and Robert Endre Tarjan. Fast algorithms for finding nearest common ancestors. SIAM J. Comput., 13(2):338-355, 1984. URL: https://doi.org/10.1137/0213024.
Md Helal Hossen, Daniel Gibney, and Sharma V Thankachan. Text indexing for faster gapped pattern matching. Algorithms, 17(12), 2024.
Costas S. Iliopoulos and M. Sohel Rahman. Indexing factors with gaps. Algorithmica, 55(1):60-70, 2009. URL: https://doi.org/10.1007/S00453-007-9141-3.
Juha Kärkkäinen and Esko Ukkonen. Lempel-ziv parsing and sublinear-size index structures for string matching. In Proc. 3rd South American Workshop on String Processing (WSP), pages 141-155, 1996.
Dominik Kempa and Tomasz Kociumaka. String synchronizing sets: sublinear-time BWT construction and optimal LCE data structure. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, June 23-26, 2019, pages 756-767. ACM, 2019. URL: https://doi.org/10.1145/3313276.3316368.
Dominik Kempa and Tomasz Kociumaka. Resolution of the burrows-wheeler transform conjecture. Commun. ACM, 65(6):91-98, 2022. URL: https://doi.org/10.1145/3531445.
Dominik Kempa and Tomasz Kociumaka. Collapsing the hierarchy of compressed data structures: Suffix arrays in optimal compressed space. In 64th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2023, Santa Cruz, CA, USA, November 6-9, 2023, pages 1877-1886. IEEE, 2023. URL: https://doi.org/10.1109/FOCS57990.2023.00114.
Dominik Kempa and Nicola Prezza. At the roots of dictionary compression: string attractors. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, Los Angeles, CA, USA, June 25-29, 2018, pages 827-840. ACM, 2018. URL: https://doi.org/10.1145/3188745.3188814.
Tomasz Kociumaka, Gonzalo Navarro, and Francisco Olivares. Near-optimal search time in δ-optimal space, and vice versa. Algorithmica, 86(4):1031-1056, 2024. URL: https://doi.org/10.1007/S00453-023-01186-0.
Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Towards a definitive measure of repetitiveness. In LATIN 2020: Theoretical Informatics - 14th Latin American Symposium, São Paulo, Brazil, January 5-8, 2021, Proceedings, volume 12118 of Lecture Notes in Computer Science, pages 207-219. Springer, 2020. URL: https://doi.org/10.1007/978-3-030-61792-9_17.
Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Toward a definitive compressibility measure for repetitive sequences. IEEE Trans. Inf. Theory, 69(4):2074-2092, 2023. URL: https://doi.org/10.1109/TIT.2022.3224382.
Sebastian Kreft and Gonzalo Navarro. Self-indexing based on LZ77. In Combinatorial Pattern Matching - 22nd Annual Symposium, CPM 2011, Palermo, Italy, June 27-29, 2011. Proceedings, volume 6661 of Lecture Notes in Computer Science, pages 41-54. Springer, 2011. URL: https://doi.org/10.1007/978-3-642-21458-5_6.
Moshe Lewenstein. Indexing with gaps. In String Processing and Information Retrieval, 18th International Symposium, SPIRE 2011, Pisa, Italy, October 17-21, 2011. Proceedings, volume 7024 of Lecture Notes in Computer Science, pages 135-143. Springer, 2011. URL: https://doi.org/10.1007/978-3-642-24583-1_14.
Moshe Lewenstein, J. Ian Munro, Yakov Nekrich, and Sharma V. Thankachan. Document retrieval with one wildcard. In Mathematical Foundations of Computer Science 2014 - 39th International Symposium, MFCS 2014, Budapest, Hungary, August 25-29, 2014. Proceedings, Part II, volume 8635 of Lecture Notes in Computer Science, pages 529-540. Springer, 2014. URL: https://doi.org/10.1007/978-3-662-44465-8_45.
Moshe Lewenstein, J. Ian Munro, Venkatesh Raman, and Sharma V. Thankachan. Less space: Indexing for queries with wildcards. Theor. Comput. Sci., 557:120-127, 2014. URL: https://doi.org/10.1016/J.TCS.2014.09.003.
Moshe Lewenstein, Yakov Nekrich, and Jeffrey Scott Vitter. Space-efficient string indexing for wildcard pattern matching. In 31st International Symposium on Theoretical Aspects of Computer Science (STACS 2014), STACS 2014, March 5-8, 2014, Lyon, France, volume 25 of LIPIcs, pages 506-517. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2014. URL: https://doi.org/10.4230/LIPICS.STACS.2014.506.
Heng Li and Richard Durbin. Fast and accurate short read alignment with burrows-wheeler transform. Bioinform., 25(14):1754-1760, 2009. URL: https://doi.org/10.1093/BIOINFORMATICS/BTP324.
Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol., 17(3):281-308, 2010. URL: https://doi.org/10.1089/CMB.2009.0169.
César Martínez-Guardiola, Nathaniel K. Brown, Fernando Silva-Coira, Dominik Köppl, Travis Gagie, and Susana Ladra. Augmented thresholds for MONI. In Data Compression Conference, DCC 2023, Snowbird, UT, USA, March 21-24, 2023, pages 268-277. IEEE, 2023. URL: https://doi.org/10.1109/DCC55655.2023.00035.
Gonzalo Navarro. Indexing highly repetitive string collections, part I: repetitiveness measures. ACM Comput. Surv., 54(2):29:1-29:31, 2022. URL: https://doi.org/10.1145/3434399.
Gonzalo Navarro. Indexing highly repetitive string collections, part II: compressed indexes. ACM Comput. Surv., 54(2):26:1-26:32, 2022. URL: https://doi.org/10.1145/3432999.
Gonzalo Navarro. Computing mems and relatives on repetitive text collections. ACM Trans. Algorithms, 21(1):12:1-12:33, 2025. URL: https://doi.org/10.1145/3701561.
Gonzalo Navarro and Veli Mäkinen. Compressed full-text indexes. ACM Comput. Surv., 39(1):2, 2007. URL: https://doi.org/10.1145/1216370.1216372.
Takaaki Nishimoto and Yasuo Tabei. Optimal-time queries on bwt-runs compressed indexes. In 48th International Colloquium on Automata, Languages, and Programming, ICALP 2021, July 12-16, 2021, Glasgow, Scotland (Virtual Conference), volume 198 of LIPIcs, pages 101:1-101:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021. URL: https://doi.org/10.4230/LIPICS.ICALP.2021.101.
Alberto Policriti and Nicola Prezza. LZ77 computation based on the run-length encoded BWT. Algorithmica, 80(7):1986-2011, 2018. URL: https://doi.org/10.1007/S00453-017-0327-Z.
Sofya Raskhodnikova, Dana Ron, Ronitt Rubinfeld, and Adam D. Smith. Sublinear algorithms for approximating string compressibility. Algorithmica, 65(3):685-709, 2013. URL: https://doi.org/10.1007/S00453-012-9618-6.
Massimiliano Rossi, Marco Oliva, Paola Bonizzoni, Ben Langmead, Travis Gagie, and Christina Boucher. Finding maximal exact matches using the r-index. J. Comput. Biol., 29(2):188-194, 2022. URL: https://doi.org/10.1089/CMB.2021.0445.
Massimiliano Rossi, Marco Oliva, Ben Langmead, Travis Gagie, and Christina Boucher. MONI: A pangenomic index for finding maximal exact matches. J. Comput. Biol., 29(2):169-187, 2022. URL: https://doi.org/10.1089/CMB.2021.0290.
Luís M. S. Russo, Gonzalo Navarro, and Arlindo L. Oliveira. Fully compressed suffix trees. ACM Trans. Algorithms, 7(4):53:1-53:34, 2011. URL: https://doi.org/10.1145/2000807.2000821.
Kunihiko Sadakane. Compressed suffix trees with full functionality. Theory Comput. Syst., 41(4):589-607, 2007. URL: https://doi.org/10.1007/S00224-006-1198-X.
Daniel Dominic Sleator and Robert Endre Tarjan. A data structure for dynamic trees. J. Comput. Syst. Sci., 26(3):362-391, 1983. URL: https://doi.org/10.1016/0022-0000(83)90006-5.
James A. Storer and Thomas G. Szymanski. Data compression via textual substitution. J. ACM, 29(4):928-951, 1982. URL: https://doi.org/10.1145/322344.322346.
Igor Tatarnikov, Ardavan Shahrabi Farahani, Sana Kashgouli, and Travis Gagie. MONI can find k-MEMs. In 34th Annual Symposium on Combinatorial Pattern Matching, CPM 2023, June 26-28, 2023, Marne-la-Vallée, France, volume 259 of LIPIcs, pages 26:1-26:14. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2023. URL: https://doi.org/10.4230/LIPICS.CPM.2023.26.
Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory, Iowa City, Iowa, USA, October 15-17, 1973, pages 1-11. IEEE Computer Society, 1973. URL: https://doi.org/10.1109/SWAT.1973.13.
Dan E. Willard. Log-logarithmic worst-case range queries are possible in space Θ(N). Inf. Process. Lett., 17(2):81-84, 1983. URL: https://doi.org/10.1016/0020-0190(83)90075-3.
Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory, 23(3):337-343, 1977. URL: https://doi.org/10.1109/TIT.1977.1055714.

Repetition Aware Text Indexing for Matching Patterns with Wildcards

Authors Daniel Gibney , Jackson Huffstutler , Mano Prakash Parthasarathi , Sharma V. Thankachan

Files

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

Repetition Aware Text Indexing for Matching Patterns with Wildcards

Authors Daniel Gibney , Jackson Huffstutler , Mano Prakash Parthasarathi , Sharma V. Thankachan

Files

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

References

Thanks for your feedback!

Could not send message