Seed-driven Learning of Position Probability Matrices from Large Sequence Sets

Toivonen, Jarkko; Taipale, Jussi; Ukkonen, Esko

doi:10.4230/LIPIcs.WABI.2017.25

File

LIPIcs.WABI.2017.25.pdf

Filesize: 0.56 MB
13 pages

Document Identifiers

DOI: 10.4230/LIPIcs.WABI.2017.25
URN: urn:nbn:de:0030-drops-76470

Author Details

Jarkko Toivonen

Jussi Taipale

Esko Ukkonen

Cite AsGet BibTex

Jarkko Toivonen, Jussi Taipale, and Esko Ukkonen. Seed-driven Learning of Position Probability Matrices from Large Sequence Sets. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 25:1-25:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)
https://doi.org/10.4230/LIPIcs.WABI.2017.25

Abstract

We formulate and analyze a novel seed-driven algorithm SeedHam for PPM learning. To learn a PPM of length l, the algorithm uses the most frequent l-mer of the training data as a seed, and then restricts the learning into a small Hamming neighbourhood of the seed. The SeedHam method is intended for PPM learning from large sequence sets (up to hundreds of Mbases) containing enriched motif instances. A robust variant of the method is introduced that decreases contamination from artefact instances of the motif and thereby allows using larger Hamming neighbourhoods. To solve the motif orientation problem in two-stranded DNA we introduce a novel seed finding rule, based on analysis of the palindromic structure of sequences. Test experiments are reported, that illustrate the relative strengths of different variants of our methods, and show that our algorithms are fast and give stable and accurate results. Availability and implementation: A C++ implementation of the method is available from https://github.com/jttoivon/seedham/ Contact: jarkko.toivonen@cs.helsinki.fi

Keywords

motif finding
transcription factor binding site
sequence analysis
Hamming distance
seed

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

Timothy L. Bailey. Dreme: motif discovery in transcription factor chip-seq data. Bioinformatics, 27(12):1653, 2011.
Michael F. Berger, Anthony A. Philippakis, Aaron M. Qureshi, Fangxue S. He, Preston W. Estep, and Martha L. Bulyk. Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nature Biotech., 24(11):1429-1435, 2006.
Mathieu Blanchette and Saurabh Sinha. Separating real motifs from their artifacts. Bioinformatics, 17(SUPPL. 1), 2001.
Peter Huggins, Shan Zhong, Idit Shiff, Rachel Beckerman, Oleg Laptenko, Carol Prives, Marcel H. Schulz, Itamar Simon, and Ziv Bar-Joseph. DECOD: fast and accurate discriminative DNA motif finding. Bioinformatics, 27(17):2361, 2011.
Arttu Jolma, Teemu Kivioja, Jarkko Toivonen, et al. Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities. Genome Res., 20(6):861-873, 2010.
Arttu Jolma, Jian Yan, Thomas Whitington, Jarkko Toivonen, et al. DNA-binding specificities of human transcription factors. Cell, 152(1-2):327-339, 2013.
Edward M. McCreight. A space-economical suffix tree construction algorithm. J. ACM, 23(2):262-272, 1976.
Arnold R. Oliphant, Christopher J. Brandl, and Kevin Struhl. Defining the sequence specificity of DNA-binding proteins by selecting binding sites from random-sequence oligonucleotides: analysis of yeast GCN4 protein. Mol. Cell. Biol., 9(7):2944-2949, 1989.
Giulio Pavesi, Giancarlo Mauri, and Graziano Pesole. In silico representation and discovery of transcription factor binding sites. Brief. Bioinformatics, 5(3):217-236, 2004.
Gordon Robertson, Martin Hirst, Matthew Bainbridge, Misha Bilenky, Yongjun Zhao, Thomas Zeng, Ghia Euskirchen, Bridget Bernier, Richard Varhol, Allen Delaney, Nina Thiessen, Obi L. Griffith, Ann He, Marco Marra, Michael Snyder, and Steven Jones. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat. Methods, 4(8):651-657, 2007.
Albin Sandelin, Wynand Alkema, Par Engstrom, Wyeth W. Wasserman, and Boris Lenhard. JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res., 32(Database issue):D91-94, 2004.
Gary D Stormo. DNA binding sites: representation and discovery. Bioinformatics, 16(1):16-23, 2000.
Gary D. Stormo, Thomas D. Schneider, Larry Gold, and Andrzej Ehrenfeucht. Use of the ‘Perceptron’ algorithm to distinguish translational initiation sites in E. coli. Nucleic Acids Res., 10(9):2997-3011, 1982.
Martin Tompa, Nan Li, Timothy L. Bailey, George M. Church, Bart De Moor, Eleazar Eskin, Alexander V. Favorov, Martin C. Frith, Yutao Fu, W. James Kent, et al. Assessing computational tools for the discovery of transcription factor binding sites. Nature biotechnology, 23(1):137-144, 2005.
Craig Tuerk and Larry Gold. Systematic evolution of ligands by exponential enrichment: RNA ligands to bacteriophage T4 DNA polymerase. Science, 249(4968):505-510, 1990.
Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249-260, 1995.
Peter Weiner. Linear pattern matching algorithms. In Switching and Automata Theory, 1973. SWAT'08. IEEE Conference Record of 14th Annual Symposium on, pages 1-11, 1973.
Edgar Wingender. The TRANSFAC project as an example of framework technology that supports the analysis of genomic regulation. Brief. Bioinformatics, 9(4):326-332, 2008.