,
Jackson Huffstutler
,
Mano Prakash Parthasarathi
,
Sharma V. Thankachan
Creative Commons Attribution 4.0 International license
We study the problem of indexing a text T[1..n] to support pattern matching with wildcards. The input of a query is a pattern P[1..m] containing h ∈ [0, k] wildcard (a.k.a. don't care) characters and the output is the set of occurrences of P in T (i.e., starting positions of substrings of T that matches P), where k = o(log n) is fixed at index construction. A classic solution by Cole et al. [STOC 2004] provides an index with space complexity O(n ⋅ (clog n)^k/k!)) and query time O(m+2^h log log n+occ), where c > 1 is a constant, and occ denotes the number of occurrences of P in T. We introduce a new data structure that significantly reduces space usage for highly repetitive texts while maintaining efficient query processing. Its space (in words) and query time are as follows:
O(δ log (n/δ)⋅ c^k (1+(log^k (δ log n))/k!)) and O((m+2^h +occ)log n))
The parameter δ, known as substring complexity, is a recently introduced measure of repetitiveness that serves as a unifying and lower-bounding metric for several popular measures, including the number of phrases in the LZ77 factorization (denoted by z) and the number of runs in the Burrows-Wheeler Transform (denoted by r). Moreover, O(δ log (n/δ)) represents the optimal space required to encode the data in terms of n and δ, helping us see how close our space is to the minimum required. In another trade-off, we match the query time of Cole et al.’s index using O(n+δ log (n/δ) ⋅ (clogδ)^{k+ε}/k!) space, where ε > 0 is an arbitrarily small constant. We also demonstrate how these techniques can be applied to a more general indexing problem, where the query pattern includes k-gaps (a gap can be interpreted as a contiguous sequence of wildcard characters).
@InProceedings{gibney_et_al:LIPIcs.ICALP.2025.88,
author = {Gibney, Daniel and Huffstutler, Jackson and Parthasarathi, Mano Prakash and Thankachan, Sharma V.},
title = {{Repetition Aware Text Indexing for Matching Patterns with Wildcards}},
booktitle = {52nd International Colloquium on Automata, Languages, and Programming (ICALP 2025)},
pages = {88:1--88:20},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {978-3-95977-372-0},
ISSN = {1868-8969},
year = {2025},
volume = {334},
editor = {Censor-Hillel, Keren and Grandoni, Fabrizio and Ouaknine, Jo\"{e}l and Puppis, Gabriele},
publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
address = {Dagstuhl, Germany},
URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2025.88},
URN = {urn:nbn:de:0030-drops-234656},
doi = {10.4230/LIPIcs.ICALP.2025.88},
annote = {Keywords: Pattern Matching, Text Indexing, Wildcard Matching}
}