Eulertigs: Minimum Plain Text Representation of k-mer Sets Without Repetitions in Linear Time
Schmidt, Sebastian
1
https://orcid.org/0000-0003-4878-2809
Alanko, Jarno N.
1
https://orcid.org/0000-0002-8003-9225
University of Helsinki, Finland
A fundamental operation in computational genomics is to reduce the input sequences to their constituent k-mers. For maximum performance of downstream applications it is important to store the k-mers in small space, while keeping the representation easy and efficient to use (i.e. without k-mer repetitions and in plain text). Recently, heuristics were presented to compute a near-minimum such representation. We present an algorithm to compute a minimum representation in optimal (linear) time and use it to evaluate the existing heuristics. For that, we present a formalisation of arc-centric bidirected de Bruijn graphs and carefully prove that it accurately models the k-mer spectrum of the input. Our algorithm first constructs the de Bruijn graph in linear time in the length of the input strings (for a fixed-size alphabet). Then it uses a Eulerian-cycle-based algorithm to compute the minimum representation, in time linear in the size of the output.
Spectrum preserving string sets
Eulerian cycle
Suffix tree
Bidirected arc-centric de Bruijn graph
k-mer based methods