Document

# Efficient Index for Weighted Sequences

## File

LIPIcs.CPM.2016.4.pdf
• Filesize: 0.5 MB
• 13 pages

## Cite As

Carl Barton, Tomasz Kociumaka, Solon P. Pissis, and Jakub Radoszewski. Efficient Index for Weighted Sequences. In 27th Annual Symposium on Combinatorial Pattern Matching (CPM 2016). Leibniz International Proceedings in Informatics (LIPIcs), Volume 54, pp. 4:1-4:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2016)
https://doi.org/10.4230/LIPIcs.CPM.2016.4

## Abstract

The problem of finding factors of a text string which are identical or similar to a given pattern string is a central problem in computer science. A generalised version of this problem consists in implementing an index over the text to support efficient on-line pattern queries. We study this problem in the case where the text is weighted: for every position of the text and every letter of the alphabet a probability of occurrence of this letter at this position is given. Sequences of this type, also called position weight matrices, are commonly used to represent imprecise or uncertain data. A weighted sequence may represent many different strings, each with probability of occurrence equal to the product of probabilities of its letters at subsequent positions. Given a probability threshold 1/z, we say that a pattern string P matches a weighted text at position i if the product of probabilities of the letters of P at positions i,...,i+|P|-1 in the text is at least 1/z. In this article, we present an O(nz)-time construction of an O(nz)-sized index that can answer pattern matching queries in a weighted text in optimal time improving upon the state of the art by a factor of z log z. Other applications of this data structure include an O(nz)-time construction of the weighted prefix table and an O(nz)-time computation of all covers of a weighted sequence, which improve upon the state of the art by the same factor.
##### Keywords
• weighted sequence
• position weight matrix
• indexing
• weighted suffix tree

## Metrics

• Access Statistics
• Total Accesses (updated on a weekly basis)
0

## References

1. Amihood Amir, Eran Chencinski, Costas S. Iliopoulos, Tsvi Kopelowitz, and Hui Zhang. Property matching and weighted matching. Theor. Comput. Sci., 395(2-3):298-310, April 2008. URL: http://dx.doi.org/10.1016/j.tcs.2008.01.006.
2. Carl Barton and Solon P. Pissis. Linear-time computation of prefix table for weighted strings. In Florin Manea and Dirk Nowotka, editors, Combinatorics on Words, WORDS 2015, volume 9304 of LNCS, pages 73-84. Springer, 2015. URL: http://dx.doi.org/10.1007/978-3-319-23660-5.
3. Michael A. Bender and Martin Farach-Colton. The LCA problem revisited. In Gaston H. Gonnet, Daniel Panario, and Alfredo Viola, editors, Latin American Symposium on Theoretical Informatics, LATIN 2000, volume 1776 of LNCS, pages 88-94. Springer Berlin Heidelberg, 2000. URL: http://dx.doi.org/10.1007/10719839_9.
4. Sudip Biswas, Manish Patil, Sharma V. Thankachan, and Rahul Shah. Probabilistic threshold indexing for uncertain strings. In Evaggelia Pitoura, Sofian Maabout, Georgia Koutrika, Amélie Marian, Letizia Tanca, Ioana Manolescu, and Kostas Stefanidis, editors, 19th International Conference on Extending Database Technology, EDBT 2016, pages 401-412. OpenProceedings.org, 2016. URL: http://dx.doi.org/10.5441/002/edbt.2016.37.
5. Dany Breslauer. The suffix tree of a tree and minimizing sequential transducers. Theor. Comput. Sci., 191(1-2):131-144, 1998. URL: http://dx.doi.org/10.1016/S0304-3975(96)00319-2.
6. Manolis Christodoulakis, Costas S. Iliopoulos, Laurent Mouchard, and Kostas Tsichlas. Pattern matching on weighted sequences. In Algorithms and Computational Methods for Biochemical and Evolutionary Networks, CompBioNets 2004, KCL publications, 2004.
7. Maxime Crochemore, Christophe Hancart, and Thierry Lecroq. Algorithms on Strings. Cambridge University Press, New York, NY, USA, 2007.
8. Michael L. Fredman, János Komlós, and Endre Szemerédi. Storing a sparse table with O(1) worst case access time. J. ACM, 31(3):538-544, 1984. URL: http://dx.doi.org/10.1145/828.1884.
9. Dov Harel and Robert Endre Tarjan. Fast algorithms for finding nearest common ancestors. SIAM J. Comput., 13(2):338-355, 1984. URL: http://dx.doi.org/10.1137/0213024.
10. Lucas Chi Kwong Hui. Color set size problem with application to string matching. In Alberto Apostolico, Maxime Crochemore, Zvi Galil, and Udi Manber, editors, Combinatorial Pattern Matching, CPM 1992, volume 644 of LNCS, pages 230-243. Springer, 1992. URL: http://dx.doi.org/10.1007/3-540-56024-6_19.
11. Costas S. Iliopoulos, Christos Makris, Yannis Panagis, Katerina Perdikuri, Evangelos Theodoridis, and Athanasios K. Tsakalidis. The weighted suffix tree: An efficient data structure for handling molecular weighted sequences and its applications. Fundam. Inform., 71(2-3):259-277, 2006. URL: http://content.iospress.com/articles/fundamenta-informaticae/fi71-2-3-07.
12. Yuxuan Li, James Bailey, Lars Kulik, and Jian Pei. Efficient matching of substrings in uncertain sequences. In Mohammed Javeed Zaki, Zoran Obradovic, Pang-Ning Tan, Arindam Banerjee, Chandrika Kamath, and Srinivasan Parthasarathy, editors, SIAM International Conference on Data Mining, SDM 2014, pages 767-775. SIAM, 2014. URL: http://dx.doi.org/10.1137/1.9781611973440.88.
13. S. Muthukrishnan. Efficient algorithms for document retrieval problems. In David Eppstein, editor, 13th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2002, pages 657-666. ACM/SIAM, 2002. URL: http://dl.acm.org/citation.cfm?id=545381.
14. Gonzalo Navarro. A guided tour to approximate string matching. ACM Comput. Surv., 33(1):31-88, 2001. URL: http://dx.doi.org/10.1145/375360.375365.
15. Tetsuo Shibuya. Constructing the suffix tree of a tree with a large alphabet. In Alok Aggarwal and C. Pandu Rangan, editors, Algorithms and Computation, ISAAC 1999, volume 1741 of LNCS, pages 225-236. Springer, 1999. URL: http://dx.doi.org/10.1007/3-540-46632-0_24.
16. Tetsuo Shibuya. Constructing the suffix tree of a tree with a large alphabet. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, E86-A(5):1061-1066, 2003.