Random Access in Persistent Strings

Bille, Philip; Gørtz, Inge Li

doi:10.4230/LIPIcs.ISAAC.2020.48

Abstract

We consider compact representations of collections of similar strings that support random access queries. The collection of strings is given by a rooted tree where edges are labeled by an edit operation (inserting, deleting, or replacing a character) and a node represents the string obtained by applying the sequence of edit operations on the path from the root to the node. The goal is to compactly represent the entire collection while supporting fast random access to any part of a string in the collection. This problem captures natural scenarios such as representing the past history of an edited document or representing highly-repetitive collections. Given a tree with n nodes, we show how to represent the corresponding collection in O(n) space and optimal O(log n/ log log n) query time. This improves the previous time-space trade-offs for the problem. To obtain our results, we introduce new techniques and ideas, including a reduction to a new geometric line segment selection together with an efficient solution.

Jérémy Barbay, Francisco Claude, Travis Gagie, Gonzalo Navarro, and Yakov Nekrich. Efficient fully-compressed sequence representations. Algorithmica, 69(1):232-268, 2014.
Jérémy Barbay, Meng He, J Ian Munro, and S Srinivasa Rao. Succinct indexes for strings, binary relations and multi-labeled trees. In Proc. 18th SODA, pages 680-689, 2007.
Djamal Belazzougui, Patrick Hagge Cording, Simon J. Puglisi, and Yasuo Tabei. Access, rank, and select in grammar-compressed strings. In Proc. 23rd ESA, pages 142-154, 2015.
Djamal Belazzougui and Gonzalo Navarro. Optimal lower and upper bounds for representing sequences. ACM Trans. Algorithms, 11(4):1-21, 2015.
Philip Bille, Anders Roy Christiansen, Patrick Hagge Cording, Inge Li Gørtz, Frederik Rye Skjoldjensen, Hjalte Wedel Vildhøj, and Søren Vind. Dynamic relative compression, dynamic partial sums, and substring concatenation. Algorithmica, 80(11):3207-3224, 2018. Announced at ISAAC 2016.
Philip Bille, Anders Roy Christiansen, Nicola Prezza, and Frederik Rye Skjoldjensen. Succinct partial sums and Fenwick trees. In Proc. 24th SPIRE, pages 91-96, 2017.
Philip Bille, Mikko Berggren Ettienne, Inge Li Gørtz, and Hjalte Wedel Vildhøj. Time-space trade-offs for lempel-ziv compressed indexing. Theoret. Comput. Sci., 713:66-77, 2018.
Philip Bille, Inge Li Gørtz, Gad M Landau, and Oren Weimann. Tree compression with top trees. Inform. and Comput., 243:166-177, 2015.
Philip Bille, Gad M. Landau, Rajeev Raman, Kunihiko Sadakane, Srinivasa Rao Satti, and Oren Weimann. Random access to grammar-compressed strings and trees. SIAM J. Comput., 44(3):513-539, 2015. Announced at SODA 2011.
M. Charikar, E. Lehman, D. Liu, R. Panigrahy, M. Prabhakaran, A. Sahai, and A. Shelat. The smallest grammar problem. IEEE Trans. Inform. Theory, 51(7):2554-2576, 2005.
Bernard Chazelle. Filtering search: A new approach to query-answering. SIAM J. Comput., 15(3):703-724, 1986.
BG Chern, Idoia Ochoa, Alexandros Manolakos, Albert No, Kartik Venkat, and Tsachy Weissman. Reference based genome compression. In Proc. 12th ITW, pages 427-431, 2012.
P. F. Dietz. Fully persistent arrays (extended array). In Proc. 1st WADS, volume 382, pages 67-74, 1989.
Paul F Dietz. Optimal algorithms for list indexing and subset rank. In Proc. 1st WADS, pages 39-46, 1989.
Huy Hoang Do, Jesper Jansson, Kunihiko Sadakane, and Wing-Kin Sung. Fast relative Lempel-Ziv self-index for similar sequences. Theoret. Comput. Sci., 532:14-30, 2014.
J. Driscoll, N. Sarnak, D. Sleator, and R. Tarjan. Making data structures persistent. J. Comput. System Sci., 38:86-124, 1989.
Peter M Fenwick. A new data structure for cumulative frequency tables. Software: Pract. Exper., 24(3):327-336, 1994.
Paolo Ferragina, Giovanni Manzini, Veli Mäkinen, and Gonzalo Navarro. Compressed representations of sequences and full-text indexes. ACM Trans. Algorithms, 3(2):20, 2007.
Paolo Ferragina and Rossano Venturini. A simple storage scheme for strings achieving entropy bounds. Theoret. Comput. Sci., 372(1):115-121, 2007.
Michael Fredman and Michael Saks. The cell probe complexity of dynamic data structures. In Proc. 21st STOC, pages 345-354, 1989.
Michael L. Fredman and Dan E. Willard. Surpassing the information theoretic bound with fusion trees. J. Comput. System Sci., 47(3):424-436, 1993.
Michael L. Fredman and Dan E. Willard. Trans-dichotomous algorithms for minimum spanning trees and shortest paths. J. Comput. System Sci., 48(3):533-551, 1994.
Travis Gagie, Paweł Gawrychowski, Juha Kärkkäinen, Yakov Nekrich, and Simon J Puglisi. A faster grammar-based self-index. In Proc. 6th LATA, pages 240-251, 2012.
Travis Gagie, Paweł Gawrychowski, Juha Kärkkäinen, Yakov Nekrich, and Simon J Puglisi. LZ77-based self-indexing with faster pattern matching. In Proc. 14th LATIN, pages 731-742, 2014.
Travis Gagie, Paweł Gawrychowski, and Simon J Puglisi. Approximate pattern matching in lz77-compressed texts. J. Discrete Algorithms, 32:64-68, 2015.
Travis Gagie, Kalle Karhu, Gonzalo Navarro, Simon J Puglisi, and Jouni Sirén. Document listing on repetitive collections. In Proc. 24th CPM, pages 107-119, 2013.
Moses Ganardi, Artur Jez, and Markus Lohrey. Balancing straight-line programs. In Proc. 60th FOCS, pages 1169-1183, 2019.
Alexander Golynski, J Ian Munro, and S Srinivasa Rao. Rank/select operations on large alphabets: a tool for text indexing. In Proc. 17th SODA, pages 368-373, 2006.
Alexander Golynski, Rajeev Raman, and S Srinivasa Rao. On the redundancy of succinct data structures. In Proc. 11th SWAT, pages 148-159, 2008.
Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. High-order entropy-compressed text indexes. In Proc. 14th SODA, pages 841-850, 2003.
Roberto Grossi, Rajeev Raman, Satti Srinivasa Rao, and Rossano Venturini. Dynamic compressed strings with random access. In Proc. 40th ICALP, pages 504-515. Springer, 2013.
Wing-Kai Hon, Kunihiko Sadakane, and Wing-Kin Sung. Succinct data structures for searchable partial sums with optimal worst-case performance. Theoret. Comput. Sci., 412(39):5176-5186, 2011.
Christopher Hoobin, Simon J Puglisi, and Justin Zobel. Relative Lempel-Ziv factorization for efficient storage and retrieval of web collections. Proc. VLDB Endowment, 5(3):265-273, 2011.
Allan Grønlund Jørgensen and Kasper Green Larsen. Range selection and median: Tight cell probe lower bounds and adaptive data structures. In Proc. 22nd SODA, pages 805-813, 2011.
Dominik Kempa and Nicola Prezza. At the roots of dictionary compression: String attractors. In Proc. 50th STOC, pages 827-840, 2018.
Shanika Kuruppu, Simon J Puglisi, and Justin Zobel. Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval. In Proc. 17th SPIRE, pages 201-206, 2010.
Shanika Kuruppu, Simon J Puglisi, and Justin Zobel. Optimized relative Lempel-Ziv compression of genomes. In Proc. 34th ACSC, pages 91-98, 2011.
Stan Y. Liao, Srinivas Devadas, and Kurt Keutzer. A text-compression-based method for code size minimization in embedded systems. Trans. Design Autom. Electr. Syst., 4(1):12-38, 1999.
Stan Y. Liao, Srinivas Devadas, Kurt Keutzer, Steven W. K. Tjiang, and Albert Wang. Code optimization techniques in embedded DSP microprocessors. Design Autom. Emb. Sys., 3(1):59-73, 1998.
Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol., 17(3):281-308, 2010.
J Ian Munro and Yakov Nekrich. Compressed data structures for dynamic sequences. In Proc. 23rd ESA, pages 891-902. Springer, 2015.
Gonzalo Navarro. Indexing highly repetitive collections. In Proc. 23rd IWOCA, pages 274-279, 2012.
Gonzalo Navarro. Document listing on repetitive collections with guaranteed performance. Theoret. Comput. Sci., 772:58-72, 2019.
Mihai Pătraşcu and Mikkel Thorup. Dynamic integer sets with optimal rank, select, and predecessor search. In Proc. 55th FOCS, pages 166-175, 2014.
Mihai Pǎtraşcu and Erik D. Demaine. Logarithmic lower bounds in the cell-probe model. SIAM J. Comput., 35(4):932-963, 2006. Announced at SODA 2004.
Rajeev Raman, Venkatesh Raman, and S Srinivasa Rao. Succinct dynamic data structures. In Proc. 7th WADS, pages 426-437, 2001.
Wojciech Rytter. Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theoret. Comput. Sci., 302(1-3):211-222, 2003.
Kunihiko Sadakane and Roberto Grossi. Squeezing succinct data structures into entropy bounds. In Proc. 17th SODA, pages 1230-1239, 2006.
Neil Sarnak and Robert Endre Tarjan. Planar point location using persistent search trees. Commun. ACM, 29(7):669-679, 1986.
James A. Storer and Thomas G. Szymanski. The macro model for data compression. In Proc. 10th STOC, pages 30-39, 1978.
James A Storer and Thomas G Szymanski. Data compression via textual substitution. J. ACM, 29(4):928-951, 1982.
Robert Endre Tarjan and Uzi Vishkin. Finding biconnected componemts and computing tree functions in logarithmic parallel time. In Proc. 25th FOCS, pages 12-20, 1984.
Elad Verbin and Wei Yu. Data structure lower bounds on random access to grammar-compressed strings. In Proc. 24th CPM, pages 247-258, 2013.
Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Trans. Inform. Theory, 23(3):337-343, 1977.

Random Access in Persistent Strings

Authors Philip Bille , Inge Li Gørtz

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Acknowledgements

References

Thanks for your feedback!

Could not send message

Random Access in Persistent Strings

Authors Philip Bille , Inge Li Gørtz

File

Document Identifiers

Related Versions

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

Acknowledgements

References

Thanks for your feedback!

Could not send message