Practical Performance of Space Efficient Data Structures for Longest Common Extensions

eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2020-08-26 39:1 39:20 10.4230/LIPIcs.ESA.2020.39 article Practical Performance of Space Efficient Data Structures for Longest Common Extensions Dinklage, Patrick 1 https://orcid.org/0000-0002-2004-6781 Fischer, Johannes 1 Herlez, Alexander 1 Kociumaka, Tomasz 2 https://orcid.org/0000-0002-2477-1702 Kurpicz, Florian 1 https://orcid.org/0000-0002-2379-9455 Department of Computer Science, Technical University of Dortmund, Germany Department of Computer Science, Bar-Ilan Unviersity, Ramat Gan, Israel For a text T[1,n], a Longest Common Extension (LCE) query lce_T(i,j) asks for the length of the longest common prefix of the suffixes T[i,n] and T[j,n] identified by their starting positions 1 ≤ i,j ≤ n. A classic problem in stringology asks to preprocess a static text T[1,n] over an alphabet of size σ so that LCE queries can be efficiently answered on-line. Since its introduction in the 1980’s, this problem has found numerous applications: in suffix sorting, edit distance computation, approximate pattern matching, regularities finding, string mining, and many more. Text-book solutions offer O(n) preprocessing time and O(1) query time, but they employ memory-heavy data structures, such as suffix arrays, in practice several times bigger than the text itself. Very recently, more space efficient solutions using O(nlogσ) bits of total space or even only O(log n) bits of extra space have been proposed: string synchronizing sets [Kempa and Kociumaka, STOC'19, and Birenzwige et al., SODA'20] and in-place fingerprinting [Prezza, SODA'18]. The goal of this article is to present well-engineered implementations of these new solutions and study their practicality on a commonly agreed text corpus. We show that both perform extremely well in practice, with space consumption of only around 10% of the input size for string synchronizing sets (around 20% for highly repetitive texts), and essentially no extra space for fingerprinting. Interestingly, our experiments also show that both solutions become much faster than naive scanning even for finding common prefixes of moderate length, contradicting a common belief that sophisticated data structures for LCE queries are not competitive with naive approaches [Ilie and Tinta, SPIRE'09]. https://drops.dagstuhl.de/storage/00lipics/lipics-vol173-esa2020/LIPIcs.ESA.2020.39/LIPIcs.ESA.2020.39.pdf text indexing longest common prefix space efficient data structures

<publisher>Schloss Dagstuhl – Leibniz-Zentrum für Informatik</publisher>

<journalTitle>Leibniz International Proceedings in Informatics</journalTitle>

<doi>10.4230/LIPIcs.ESA.2020.39</doi>

<documentType>article</documentType>

<title language="eng">Practical Performance of Space Efficient Data Structures for Longest Common Extensions</title>

<name>Dinklage, Patrick</name>

<orcid_id>https://orcid.org/0000-0002-2004-6781</orcid_id>

</author>

<name>Fischer, Johannes</name>

</author>

<name>Herlez, Alexander</name>

</author>

<name>Kociumaka, Tomasz</name>

<orcid_id>https://orcid.org/0000-0002-2477-1702</orcid_id>

</author>

<name>Kurpicz, Florian</name>

<orcid_id>https://orcid.org/0000-0002-2379-9455</orcid_id>

</author>

</authors>

<affiliationName affiliationId="1">Department of Computer Science, Technical University of Dortmund, Germany</affiliationName>

<affiliationName affiliationId="2">Department of Computer Science, Bar-Ilan Unviersity, Ramat Gan, Israel</affiliationName>

</affiliationsList>

<abstract language="eng">For a text T[1,n], a Longest Common Extension (LCE) query lce_T(i,j) asks for the length of the longest common prefix of the suffixes T[i,n] and T[j,n] identified by their starting positions 1 ≤ i,j ≤ n. A classic problem in stringology asks to preprocess a static text T[1,n] over an alphabet of size σ so that LCE queries can be efficiently answered on-line. Since its introduction in the 1980’s, this problem has found numerous applications: in suffix sorting, edit distance computation, approximate pattern matching, regularities finding, string mining, and many more. Text-book solutions offer O(n) preprocessing time and O(1) query time, but they employ memory-heavy data structures, such as suffix arrays, in practice several times bigger than the text itself. Very recently, more space efficient solutions using O(nlogσ) bits of total space or even only O(log n) bits of extra space have been proposed: string synchronizing sets [Kempa and Kociumaka, STOC'19, and Birenzwige et al., SODA'20] and in-place fingerprinting [Prezza, SODA'18]. The goal of this article is to present well-engineered implementations of these new solutions and study their practicality on a commonly agreed text corpus. We show that both perform extremely well in practice, with space consumption of only around 10% of the input size for string synchronizing sets (around 20% for highly repetitive texts), and essentially no extra space for fingerprinting. Interestingly, our experiments also show that both solutions become much faster than naive scanning even for finding common prefixes of moderate length, contradicting a common belief that sophisticated data structures for LCE queries are not competitive with naive approaches [Ilie and Tinta, SPIRE'09].</abstract>

<fullTextUrl format="pdf">https://drops.dagstuhl.de/storage/00lipics/lipics-vol173-esa2020/LIPIcs.ESA.2020.39/LIPIcs.ESA.2020.39.pdf</fullTextUrl>

<keyword>text indexing</keyword>

<keyword>longest common prefix</keyword>

<keyword>space efficient data structures</keyword>

</keywords>

</record>

</records>