All Fingers Are Not the Same: Handling Variable-Length Sequences in a Discriminative Setting Using Conformal Multi-Instance Kernels

Authors Sarvesh Nikumbh, Peter Ebert, Nico Pfeifer



PDF
Thumbnail PDF

File

LIPIcs.WABI.2017.16.pdf
  • Filesize: 1.07 MB
  • 14 pages

Document Identifiers

Author Details

Sarvesh Nikumbh
Peter Ebert
Nico Pfeifer

Cite As Get BibTex

Sarvesh Nikumbh, Peter Ebert, and Nico Pfeifer. All Fingers Are Not the Same: Handling Variable-Length Sequences in a Discriminative Setting Using Conformal Multi-Instance Kernels. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 16:1-16:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017) https://doi.org/10.4230/LIPIcs.WABI.2017.16

Abstract

Most string kernels for comparison of genomic sequences are generally tied to using (absolute) positional information of the features in the individual sequences. This poses limitations when comparing variable-length sequences using such string kernels. For example, profiling chromatin interactions by 3C-based experiments results in variable-length genomic sequences (restriction fragments). Here, exact position-wise occurrence of signals in sequences may not be as important as in the scenario of analysis of the promoter sequences, that typically have a transcription start site as reference. Existing position-aware string kernels have been shown to be useful for the latter scenario.

In this work, we propose a novel approach for sequence comparison that enables larger positional freedom than most of the existing approaches, can identify a possibly dispersed set of features in comparing variable-length sequences, and can handle both the aforementioned scenarios. Our approach, \emph{CoMIK}, identifies not just the features useful towards classification but also their locations in the variable-length sequences, as evidenced by the results of three binary classification experiments, aided by recently introduced visualization techniques. Furthermore, we show that we are able to efficiently retrieve and interpret the weight vector for the complex setting of multiple multi-instance kernels.

Subject Classification

Keywords
  • Multiple instance learning
  • conformal MI kernels
  • 5C
  • Hi-C

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Francis R. Bach, Gert R. G. Lanckriet, and Michael I. Jordan. Multiple kernel learning, conic duality, and the SMO algorithm. In Proceedings of the Twenty-first International Conference on Machine Learning, ICML'04, page 6, New York, NY, USA, 2004. ACM. URL: http://dx.doi.org/10.1145/1015330.1015424.
  2. Matthew B. Blaschko and Thomas Hofmann. Conformal multi-instance kernels. In NIPS 2006 Workshop on Learning to Compare Examples, 2006. Google Scholar
  3. Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, COLT'92, pages 144-152, New York, NY, USA, 1992. ACM. URL: http://dx.doi.org/10.1145/130385.130401.
  4. Jennifer E. F. Butler and James T. Kadonaga. The RNA polymerase II core promoter: a key component in the regulation of gene expression. Genes &Development, 16(20):2583-2592, 2002. URL: http://dx.doi.org/10.1101/gad.1026202.
  5. Douglass R. Cutting, David R. Karger, Jan O. Pedersen, and John W. Tukey. Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'92, pages 318-329, New York, NY, USA, 1992. ACM. URL: http://dx.doi.org/10.1145/133160.133214.
  6. Thomas G. Dietterich, Richard H. Lathrop, Tomas Lozano-Perez, and Arris Pharmaceutical. Solving the multiple-instance problem with axis-parallel rectangles. Artificial Intelligence, 89:31-71, 1997. Google Scholar
  7. Charles Elkan. The foundations of cost-sensitive learning. In Proceedings of the 17th International Joint Conference on Artificial Intelligence - Volume 2, IJCAI'01, pages 973-978, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. Google Scholar
  8. Thomas Gärtner, Peter A. Flach, Adam Kowalczyk, and Alex J. Smola. Multi-instance kernels. In Proc. 19th International Conf. on Machine Learning, pages 179-186, Massachusetts, 2002. Morgan Kaufmann. Google Scholar
  9. C. Leslie, E. Eskin, and W. S. Noble. The spectrum kernel: A string kernel for SVM protein classification. In Proceedings of the Pacific Symposium on Biocomputing, volume 7, pages 566-575, 2002. Google Scholar
  10. Christina S. Leslie, Eleazar Eskin, Adiel Cohen, Jason Weston, and William Stafford Noble. Mismatch string kernels for discriminative protein classification. Bioinformatics, 20(4):467-476, 2004. URL: http://dx.doi.org/10.1093/bioinformatics/btg431.
  11. Thomas Lingner and Peter Meinicke. Remote homology detection based on oligomer distances. Bioinformatics, 22(18):2224-2231, September 2006. URL: http://dx.doi.org/10.1093/bioinformatics/btl376.
  12. Shai Lubliner, Ifat Regev, Maya Lotan-Pompan, Sarit Edelheit, Adina Weinberger, and Eran Segal. Core promoter sequence in yeast is a major determinant of expression level. Genome research, 25(7):1008-1017, 2015. Google Scholar
  13. Peter Meinicke, Maike Tech, Burkhard Morgenstern, and Rainer Merkl. Oligo kernels for datamining on biological sequences: a case study on prokaryotic translation initiation sites. BMC Bioinformatics, 5(1):169, 2004. URL: http://dx.doi.org/10.1186/1471-2105-5-169.
  14. Sarvesh Nikumbh and Nico Pfeifer. Genetic sequence-based prediction of long-range chromatin interactions suggests a potential role of short tandem repeat sequences in genome organization. BMC Bioinformatics, 18(1):218, 2017. URL: http://dx.doi.org/10.1186/s12859-017-1624-x.
  15. G. Rätsch, S. Sonnenburg, and B. Schölkopf. RASE: recognition of alternatively spliced exons in C.elegans. Bioinformatics, 21(suppl 1):i369-i377, 2005. URL: http://dx.doi.org/10.1093/bioinformatics/bti1053.
  16. Gunnar Rätsch and Sören Sonnenburg. Accurate splice site prediction for caenorhabditis elegans. In Kernel Methods in Computational Biology, MIT Press series on Computational Molecular Biology, pages 277-298. MIT Press, Cambridge, MA., 2004. Google Scholar
  17. Hiroto Saigo, Jean-Philippe Vert, Nobuhisa Ueda, and Tatsuya Akutsu. Protein homology detection using string alignment kernels. Bioinformatics, 20(11):1682-1689, July 2004. URL: http://dx.doi.org/10.1093/bioinformatics/bth141.
  18. Amartya Sanyal, Bryan R. Lajoie, Gaurav Jain, and Job Dekker. The long-range interaction landscape of gene promoters. Nature, 489(7414):109-113, Sep 2012. URL: http://dx.doi.org/10.1038/nature11279.
  19. Sebastian J. Schultheiss, Wolfgang Busch, Jan U. Lohmann, Oliver Kohlbacher, and Gunnar Rätsch. Kirmes: kernel-based identification of regulatory modules in euchromatic sequences. Bioinformatics, 25(16):2126-2133, 2009. URL: http://dx.doi.org/10.1093/bioinformatics/btp278.
  20. John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, New York, NY, USA, 2004. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail