Fast Lightweight Accurate Xenograft Sorting

Authors Jens Zentgraf , Sven Rahmann

Thumbnail PDF


  • Filesize: 0.53 MB
  • 16 pages

Document Identifiers

Author Details

Jens Zentgraf
  • Bioinformatics, Computer Science XI, TU Dortmund University, Germany
Sven Rahmann
  • Genome Informatics, Institute of Human Genetics, University of Duisburg-Essen, Essen, Germany

Cite AsGet BibTex

Jens Zentgraf and Sven Rahmann. Fast Lightweight Accurate Xenograft Sorting. In 20th International Workshop on Algorithms in Bioinformatics (WABI 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 172, pp. 4:1-4:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)


Motivation: With an increasing number of patient-derived xenograft (PDX) models being created and subsequently sequenced to study tumor heterogeneity and to guide therapy decisions, there is a similarly increasing need for methods to separate reads originating from the graft (human) tumor and reads originating from the host species' (mouse) surrounding tissue. Two kinds of methods are in use: On the one hand, alignment-based tools require that reads are mapped and aligned (by an external mapper/aligner) to the host and graft genomes separately first; the tool itself then processes the resulting alignments and quality metrics (typically BAM files) to assign each read or read pair. On the other hand, alignment-free tools work directly on the raw read data (typically FASTQ files). Recent studies compare different approaches and tools, with varying results. Results: We show that alignment-free methods for xenograft sorting are superior concerning CPU time usage and equivalent in accuracy. We improve upon the state of the art by presenting a fast lightweight approach based on three-way bucketed quotiented Cuckoo hashing. Our hash table requires memory comparable to an FM index typically used for read alignment and less than other alignment-free approaches. It allows extremely fast lookups and uses less CPU time than other alignment-free methods and alignment-based methods at similar accuracy.

Subject Classification

ACM Subject Classification
  • Applied computing → Molecular sequence analysis
  • Applied computing → Bioinformatics
  • Theory of computation → Bloom filters and hashing
  • Theory of computation → Data structures design and analysis
  • xenograft sorting
  • alignment-free method
  • Cuckoo hashing
  • k-mer


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. M. J. Ahdesmäki, S. R. Gray, J. H. Johnson, and Z. Lai. Disambiguate: An open-source application for disambiguating two species in next generation sequencing data from grafted samples. F1000Res, 5:2741, 2016. Google Scholar
  2. Simon Andrews. FastQC: A quality control tool for high throughput sequence data, 2010. URL:
  3. N. L. Bray, H. Pimentel, P. Melsted, and L. Pachter. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol., 34(5):525-527, May 2016. Erratum in Nat. Biotechnol. 34(8):888 (2016). Google Scholar
  4. Brian Bushnell. BBsplit, 2014-2020. Part of BBTools, URL:
  5. M. Callari, A. S. Batra, R. N. Batra, S. J. Sammut, W. Greenwood, H. Clifford, C. Hercus, S. F. Chin, A. Bruna, O. M. Rueda, and C. Caldas. Computational approach to discriminate human and mouse sequences in patient-derived tumour xenografts. BMC Genomics, 19(1):19, 2018. Google Scholar
  6. C. Camacho, G. Coulouris, V. Avagyan, N. Ma, J. Papadopoulos, K. Bealer, and T. L. Madden. BLAST+: architecture and applications. BMC Bioinformatics, 10:421, December 2009. Google Scholar
  7. T. Conway, J. Wazny, A. Bromage, M. Tymms, D. Sooraj, E. D. Williams, and B. Beresford-Smith. Xenome-a tool for classifying reads from xenograft samples. Bioinformatics, 28(12):i172-i178, June 2012. Google Scholar
  8. W. Dai, J. Liu, Q. Li, W. Liu, Y. X. Li, and Y. Y. Li. A comparison of next-generation sequencing analysis methods for cancer xenograft samples. J Genet Genomics, 45(7):345-350, 2018. Google Scholar
  9. Gnöknur Giner. XenoSplit, 2019. Unpublished; source code available at URL:
  10. S. Y. Jo, E. Kim, and S. Kim. Impact of mouse contamination in genomic profiling of patient-derived models and best practice for robust analysis. Genome Biology, 20(1):Article 231, November 2019. URL:
  11. W. J. Kent. BLAT-the BLAST-like alignment tool. Genome Res., 12(4):656-664, April 2002. Google Scholar
  12. G. Khandelwal, M. R. Girotti, C. Smowton, S. Taylor, C. Wirth, M. Dynowski, K. K. Frese, G. Brady, C. Dive, R. Marais, and C. Miller. Next-generation sequencing analysis and algorithms for PDX and CDX models. Mol. Cancer Res., 15(8):1012-1016, August 2017. Google Scholar
  13. R. J. C. Kluin, K. Kemper, T. Kuilman, J. R. de Ruiter, V. Iyer, J. V. Forment, P. Cornelissen-Steijger, I. de Rink, P. Ter Brugge, J. Y. Song, S. Klarenbeek, U. McDermott, J. Jonkers, A. Velds, D. J. Adams, D. S. Peeper, and O. Krijgsman. XenofilteR: computational deconvolution of mouse and human reads in tumor xenograft sequence data. BMC Bioinformatics, 19(1):366, October 2018. Google Scholar
  14. Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. Numba: a LLVM-based python JIT compiler. In Hal Finkel, editor, Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, LLVM 2015, pages 7:1-7:6. ACM, 2015. URL:
  15. Marcel Martin. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal, 17(1):10-12, May 2011. URL:
  16. D. S. Standage, C. T. Brown, and F. Hormozdiari. Kevlar: A mapping-free framework for accurate discovery of de novo variants. iScience, 18:28-36, July 2019. Google Scholar
  17. Stefan Walzer. Load thresholds for cuckoo hashing with overlapping blocks. In Ioannis Chatzigiannakis, Christos Kaklamanis, Dániel Marx, and Donald Sannella, editors, 45th International Colloquium on Automata, Languages, and Programming, ICALP 2018, volume 107 of LIPIcs, pages 102:1-102:10. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2018. URL:
  18. Jens Zentgraf, Henning Timm, and Sven Rahmann. Cost-optimal assignment of elements in genome-scale multi-way bucketed cuckoo hash tables. In Proceedings of the Symposium on Algorithm Engineering and Experiments (ALENEX) 2020, pages 186-198. SIAM, 2020. URL: