Haplotype-Aware Long-Read Error Correction
Abstract
Error correction of long reads is an important initial step in genome assembly workflows. For organisms with ploidy greater than one, it is important to preserve haplotype-specific variation during read correction. This challenge has driven the development of several haplotype-aware correction methods. However, existing methods are based on either ad-hoc heuristics or deep learning approaches. In this paper, we introduce a rigorous formulation for this problem. Our approach builds on the minimum error correction framework used in reference-based haplotype phasing. We prove that the proposed formulation for error correction of reads in de novo context, i.e., without using a reference genome, is NP-hard. To make our exact algorithm scale to large datasets, we introduce practical heuristics. Experiments using PacBio HiFi sequencing datasets from human and plant genomes show that our approach achieves accuracy comparable to state-of-the-art methods. The software is freely available at https://github.com/at-cg/HALE.
Keywords and phrases:
Genome assembly, phasing, clustering, overlap graph, consensusCopyright and License:
2012 ACM Subject Classification:
Applied computing Computational genomicsSupplementary Material:
Software (Source Code): https://github.com/at-cg/HALEarchived at
swh:1:dir:c16db2b0e0697006fa3992e023285e8d75fce093
Acknowledgements:
The authors thank Bikram Kumar Panda, Sudhanva Shyam Kamath, Mile Sikic, and Paul Medvedev for providing useful feedback during the project.Funding:
This research is supported in part by funding from the DBT/Wellcome Trust India Alliance (IA/I/23/2/506979). We used computing resources provided by the National Energy Research Scientific Computing Center (NERSC), USA.Editors:
Broňa Brejová and Rob PatroSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
Recent improvements in long-read sequencing technologies in terms of read length and accuracy have led to a paradigm shift in the quality of genome assemblies [14]. Currently, Oxford Nanopore Technology (ONT) long reads have error rates between and , while Pacific Biosciences (PacBio) reads have lower error rates below [8]. Error correction is important for distinguishing haplotypes and repeats during genome assembly [20]. The key challenge is that the sequencing error rates in long reads often exceed the variance between repetitive genomic regions that we want to separate.
It is possible to correct sequencing errors by leveraging redundancy in sequencing data because each base in the genome is sampled by multiple reads during whole-genome sequencing [23]. One simple approach is to consider all-vs-all overlaps among reads, and subsequently correct each read by doing a majority vote among the bases of reads that overlap it. Unfortunately, this direct strategy leads to a loss of haplotype-specific and repeat-specific variation. As an example, if two genomic regions of a haploid genome differ at only a few positions, then their corresponding reads may overlap with each other. The majority vote strategy in such cases would eliminate the true biological differences. This issue becomes even more pronounced in diploid and polyploid genomes. In humans, the two haplotypes are approximately 99.9% identical. As a result, reads from different haplotypes overlap frequently. Complicating matters even further, both sequencing errors and biological variations are unevenly distributed across the genome. Certain regions of a genome, such as homopolymers, have high error rates [11].
Long-read assemblers that support diploid genome assembly [1, 2, 3, 13] incorporate haplotype-aware error correction as a mandatory step in their pipelines. More recently, dedicated tools for error correction [10, 18] have also been developed. Most algorithms begin by computing all-vs-all overlaps between reads. For instance, given a target read to be corrected, Hifiasm [2] considers the alignments of overlapping reads to that target read. It uses heuristics to identify heterozygous loci in the target read. Subsequently, it discards those overlapping reads that do not match the target read at all the heterozygous loci. Herro [18] is a dedicated error correction tool that uses a supervised learning model composed of convolutional blocks and a transformer encoder. PECAT [13] uses a heuristic scoring strategy based on partial-order alignment graph [6] for each read to be corrected. DeChat [10] also adopts a heuristic-driven method, performing an initial pre-correction using a de Bruijn graph, followed by a variant-aware multiple sequence alignment strategy adapted from Hifiasm.
In this work, we introduce a rigorous mathematical formulation of the haplotype-aware error correction problem. For each target read, we define a combinatorial optimization problem aimed at selecting reads that originate from the same haplotype and genomic region. An accurate selection of such reads enables a direct consensus-based correction using these reads. We prove that this optimization problem is NP-hard via a reduction from the Max-Cut problem. Despite its hardness, we show that it can be practically solved for human genome datasets using a combination of simple heuristics and brute-force search. We implemented our approach in a tool named HALE (Haplotype-aware Long-read Error correction) and evaluated its performance on PacBio HiFi datasets from human and plant genomes. HALE matches the accuracy of state-of-the-art deep learning and heuristic-based methods while offering a more straightforward and theoretically grounded approach.
2 Preliminaries
Our proposed formulation (defined later in Section 3) is partly inspired by the Hypercube 2-segmentation (H2S) problem that was originally introduced by Kleinberg, Papadimitriou and Raghavan in 1998 for bi-clustering [5]. The problem statement is as follows.
H2S Problem.
Given a set of vectors in , one needs to select two centers and in maximizing
where function counts on how many coordinates vectors and agree (which is also minus the Hamming distance between and ).
Another way to interpret the above problem is that we seek to partition input vectors into two clusters such that the number of bit flips needed to make all the vectors agree within each cluster is minimum. H2S problem is known to be NP-hard [4].
A generalization of the above problem, where each vector can have one or more consecutive wildcard symbols at both ends, is commonly used in the context of reference-guided haplotype phasing [9, 15]. This generalized version of the problem is also referred to as the minimum error correction (MEC) problem. Solving the MEC problem is useful to determine the variants that are co-located on the same (maternal or paternal) haplotype. The MEC-based approach assumes the availability of a reference genome because each input vector is derived from an alignment of a read to the reference genome.
For de novo assembly, we cannot assume the availability of a reference genome. However, we assume the availability of a collection of overlaps between reads, which is a common first step in most long-read assemblers [12]. An overlap either implies (i) a sufficiently long approximate match between the suffix of a read and the prefix of another read or (ii) an approximate match of an entire read to a substring of another read. The overlaps identified using approximate string matching algorithms would comprise both true overlaps and false overlaps. The false overlaps arise from repetitive sequences within or across haplotypes. Accordingly, when considering a target read alongside its overlapping reads, we may have one, two, or even multiple “read clusters”, each corresponding to different haplotypes or genomic loci.
3 Problem Formulation
Our input is a multiple sequence alignment (MSA) between a substring of a target read and overlapping substrings from other reads (see Figure 1 for an illustration). We defer the details of how these MSAs are constructed to Section 5. Suppose that the MSA comprises vectors in , where is the number of overlapping substrings and is the length of the MSA. We denote the vector in the MSA that represents the substring of the target read as , and the set of all other vectors representing the overlapping substrings as .
Some vectors in represent true overlaps. They may differ from vector due to sequencing errors in the reads. The remaining vectors differ from due to both sequencing errors and haplotype-specific or repeat-specific variations. Accurately identifying the true overlaps is crucial for error correction, as it enables the use of a simple majority-vote strategy within the true overlaps. To achieve this, we introduce the following parsimony-based formulation aimed at selecting vectors from that, along with vector , maximize a clustering score.
Problem 1.
Given a vector t in , a set of vectors in , and a positive integer , compute a subset of cardinality and a center in such that is maximum.
We make a few remarks on this problem formulation. First, note that the symbol “” is not given any special treatment and is processed like any other character. Second, vector in the solution satisfies for all , where ties are broken arbitrarily. Accordingly, can be trivially computed once is known. After solving Problem 1, vector will be interpreted as the corrected version of vector . Third, is a user-defined parameter in Problem 1. Its value should depend on the sequencing coverage. One can approximately set to the largest value such that every genomic interval of length is sequenced at least times with high probability.
Our problem formulation differs fundamentally from reference-guided haplotype phasing, e.g., WhatsHap [15]. Traditional reference-guided phasing aims to reconstruct two haplotypes by partitioning reads into two clusters. This clustering is disjoint and exhaustive, i.e., every read is assigned to exactly one of the two clusters. In contrast, we focus on identifying a subset of reads that corresponds to the haplotype and genomic region of the target read. The proposed formulation is more compatible with the de novo setting where the number of read clusters can vary.
4 Hardness Result
Theorem 2.
Problem 1 is NP-hard.
Our proof technique is inspired by the technique used by Feige [4] to prove the hardness of the H2S problem. However, several arguments differ in our proof because we seek only one cluster in Problem 1 whereas the H2S problem seeks two clusters. Rather than proving the hardness of Problem 1 directly, we consider a related problem that slightly generalizes it.
Problem 3.
Given a set of vectors in and a positive integer , compute a subset of cardinality at least that maximizes the norm of the vector sum, i.e. .
In the Appendix (Lemma 7), we argue that Theorem 2 follows if Problem 3 is NP-hard. Our reduction to prove the hardness of Problem 3 uses Hadamard codes. A Hadamard code of dimension is a collection of vectors in such that every two vectors are orthogonal. There are well-known polynomial-time algorithms to construct recursively when is a power of two [19]. Therefore, we assume that is a power of two. In the following lemma, we restate a useful property of Hadamard codes from [4].
Lemma 4.
Consider an arbitrary set of distinct vectors from the Hadamard code . The norm of their sum is at most .
Lemma 5.
Problem 3 is NP-hard.
Proof.
We reduce from the MAX-CUT problem. In the MAX-CUT problem, one is given an undirected graph, say , and is asked to find a subset of vertices that maximises the number of edges between and its complement .
We will convert into an instance of Problem 3. Let us assign an arbitrary orientation to each edge of . We will assume an integer parameter (we will later find an appropriate value of which is polynomial in the size of the graph).
We build a set comprising vectors of dimension . For every vertex of , we introduce vectors each of length (see Figure 2 for an illustration). Each vector is divided into blocks of length each. In each of the vectors, in every block associated with an edge :
-
1.
If is head of edge , then all entries of are .
-
2.
If is tail of edge , then all entries of are .
-
3.
If is not incident with edge , then the entries of in vector (for ) are the vector of Hadamard code .
Lastly, we set the parameter .
Yes instances.
Suppose that admits an optimal solution having edges between and . WLOG, we can assume that , and hence . Consider one (possibly sub-optimal) solution to the instance of Problem 3 that selects all vectors from that correspond to vertices in . Let us denote the set of these vectors as .
A lower bound on the norm of the vector sum, i.e., , can be obtained as follows. Recall that each vector in has blocks. Consider a block associated with an edge that is in the cut. There is exactly one endpoint of this edge in , and all the vectors corresponding to this endpoint (vertex) are monochromatic blocks of length , which together contribute a total of to the norm. This can be partially offset by other blocks. But the block of each vertex not incident with can offset at most of the norm (using Lemma 4). For an edge , there are vertices not incident with . This makes the total contribution of the blocks corresponding to edge to the norm at least . As there are edges in the cut, the value of this solution is at least . (The value is actually higher because the blocks associated with the edges not in the cut also contribute to the norm, but we ignore this further tightening of the bound.)
No instances.
Suppose that admits an optimal solution with edges in the cut. Consider an arbitrary subset of vectors from with cardinality at least . Call this subset of vectors . This selection corresponds to a fractional partition of where the extent to which vertex is in partition is equal to the fraction of its vectors in . Let denote this extent for vector . Similarly, let be the extent to which the vertex is in partition . For an edge , the extent to which it is cut is .
For an arbitrary edge , the monochromatic blocks associated with its two endpoints together contribute to the norm of . The blocks associated with vertices that are not endpoints of , each contribute at most towards the norm of . Summing up over all edges and all blocks, the value of any solution is at most . Further, . This is because it is possible to change a fractional cut into an integer cut which is at least as large (see Lemma 13 in Appendix).
Summary.
If we subtract the upper bound for no instances from the lower bound for yes instances, it follows that the yes instance leads to an higher value than the no instance if
| or |
Taking suffices.
5 Algorithm and Implementation
This section outlines the details of our error-correction algorithm, HALE. Since computing exact MSA and solving Problem 1 are both NP-hard, we incorporated heuristics in our implementation. At the preprocessing stage, we use Minimap2 [7] to compute overlaps among all reads. Thus, for each read, we have a set of its pairwise alignments with other reads. We perform error correction of each read independently. In the following, we refer to the read being corrected as target read.
Step 1: Constructing multiple sequence alignments
To construct MSAs, we adapted code from the Herro software [18]. We briefly summarize the approach here. Herro employs the well-known star alignment heuristic, where the target read is assumed to be the “center of the star”. The MSA is constructed progressively, i.e., adding each overlapping read to the MSA one by one using its precomputed pairwise alignment with the target read. For further details on the star alignment heuristic, see [17].
Once the full MSA is constructed, it is divided into contiguous, non-overlapping windows of fixed length (Figure 1). Each window represents an MSA of a substring of the target read and its overlapping substrings. Henceforth, each windowed MSA is processed independently. Within each window, alignments from overlapping reads which do not fully span the window are ignored (Figure 1). We use the same window length as Herro.
If there are less than two alignments from overlapping reads in a window, which can happen occasionally in regions with low coverage, then Herro chops and removes the corresponding substring of the target read instead of leaving it uncorrected. We do the same in HALE. Note that this procedure may split the target read. The rationale is that the uncorrected portions of a read may negatively affect the quality of genome assembly, hence they are removed.
Step 2: Sampling overlaps based on sequence identity
The count of overlapping substrings within a window is influenced by the sequencing coverage of the loci linked to the target read and how repetitive those regions are within the genome. To maintain computational efficiency of next steps, we retain only top- overlapping substrings in each MSA. Preference is given to those overlapping substrings that have higher sequence identity111Given a pairwise alignment of two sequences, sequence identity is defined as the number of matching bases over the number of alignment columns. with the target read. If fewer than overlaps are available, we retain all. In HALE, the default value of parameter is set to .
Step 3: Sampling informative alignment columns
Following the approach of Herro [18] and Hifiasm [2], we identify informative alignment columns in the MSA. A column is considered informative if at least two distinct characters from the set occur in that column with a minimum frequency of . See Figure 3 for an illustration. Only a small fraction of alignment columns are informative in practice. Informative columns typically indicate either biological variation among the aligned substrings or systematic sequencing errors, e.g., homopolymer indel errors. In contrast, non-informative columns are those where alignments largely agree. These columns have limited value for identifying true overlaps in Problem 1. Thus, we remove all non-informative columns from the MSA and use the modified MSA as input to Problem 1.
Step 4: Solving Problem 1
A polynomial-time algorithm is unlikely for Problem 1 because we have established its hardness in Section 4. We use a direct brute-force approach to find an exact solution. In the MSA input to Problem 1, we have a vector in (representing a subsequence of target read) along with other vectors in (from the overlapping reads). Here, denotes the number of columns in a given MSA. We solve Problem 1 by exhaustively evaluating all possible ways of selecting a subset of vectors from the total of . For a given selection of vectors, an optimal value of the cluster center at the coordinate is the character that occurs most frequently at that position among the vectors and the target vector . Although the runtime of this algorithm is , this brute-force solution is still practical due to the reduction in the size of MSA at Steps 2 and 3.
The default value of parameter in HALE is set to . This choice is based on our empirical observation that at least one-third of the overlapping substrings typically originate from the same haplotype as the target read (assuming ploidy ). Choosing an appropriate value for involves a trade-off: a significantly larger can lead to the selection of reads with false overlaps, while a much smaller leads to an insufficient number of reads to reliably compute a consensus.
We introduce an additional optimization. Note that Problem 1 treats all MSA columns with equal importance. To further improve the quality of results, we solve a weighted version of Problem 1 defined below. Suppose that weight function assigns a non-negative weight to each column and function returns if characters and match, and otherwise.
Problem 6.
Given a vector t in , a set of vectors in , and a positive integer , compute a subset of cardinality and a center in such that is maximum.
Our motivation to introduce the weights is as follows. A majority of errors in long reads are insertions and deletions, especially in homopolymers regions [11]. Accordingly, informative columns where the most frequent and the second most frequent characters are from are likely to represent true biological variation. We assign a weight to such columns. In contrast, other informative columns where either the most frequent or the second most frequent character is “” may correspond to either true indel variation or sequencing artifacts. We assign a weight to these columns. The choice of this parameter was based on our empirical observations. Developing a more principled approach for determining the weight parameters is left for future work.
Step 5: Correcting the target read
Recall that before sampling alignment columns in the MSA (Step 3), we had a substring of the target read aligned with overlapping substrings from other reads, forming an MSA of length , where is the window length parameter. The alignment of the substring from the target read corresponds to a vector in . We update this vector as follows. For each non-informative column, where biological variation is not expected, we replace the character with the most frequent character in that column, which is equivalent to taking a majority vote. In all informative columns, we use the the optimal cluster center , as determined by solving Problem 1, to update the characters. Finally, we generate the corrected version of the target read substring by removing any gap symbols (“”) from the updated vector.
6 Results
We evaluated the performance of HALE using publicly-available PacBio HiFi sequencing datasets. The commands and software versions to reproduce the results below are available in Appendix Tables 3 and 4. The HALE source code is accessible at https://github.com/at-cg/HALE.
Datasets
We utilized two sequencing datasets, one from a diploid HG002 human sample222https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/scratch/HG002/sequencing/hifirevio/, and the other from a haploid Col-0 Arabidopsis thaliana sample333https://ngdc.cncb.ac.cn/gsa/browse/CRA004538/CRR302668. Following the benchmarking approach used in Herro, we restricted our evaluation to reads that are at least kbp in length. Reads shorter than kbp were removed.
To reduce runtime of the experiments involving the human dataset, we sampled reads from chromosome 9 of the HG002 genome. Chromosome 9 contains a high abundance of satellite DNA sequences (long arrays of tandem repeats) which makes it one of the more challenging chromosomes to assemble [14]. To isolate reads from chromosome 9, we aligned all reads to HG002v1.1 assembly444https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/HG002/assemblies/hg002v1.1.fasta.gz (GCA_018852605.3, GCA_018852615.3) and retained those reads whose primary alignments mapped to either the paternal or maternal haplotype of chromosome 9. To evaluate the impact of sequencing coverage on error correction performance, we created two versions of HG002 and A. thaliana datasets, one with coverage and the other with coverage by randomly sampling reads.
Evaluated algorithms
We compared HALE against two state-of-the-art read correction tools, Herro [18] and Hifiasm [2]. Herro is a deep learning method that was primarily developed for correcting nanopore sequencing reads. Since a pretrained model for HiFi reads is not available in Herro, we used its existing model trained on Oxford Nanopore (R10) data. We assumed that the ONT-trained model generalizes to HiFi reads because HiFi reads are relatively more accurate and exhibit fewer systematic errors.
In addition to these baseline methods, we also compared HALE with two simplified correction methods. As a reminder, HALE performs a majority vote on non-informative alignment columns while using a more sophisticated technique based on solving Problem 1 in the informative alignment columns (Section 5). The naive methods employ the same heuristics described in Section 5, but bypass solving Problem 1 (Step 4).
In the first naive approach (Naive-1), a majority vote is performed across all alignment columns, including informative ones, during Step 5. In the second naive approach (Naive-2), the original values in the aligned target read are preserved at informative columns, while a majority vote is applied only to non-informative columns. This comparison allows us to isolate the benefit of solving Problem 1 as an optimization step in HALE.
Evaluation metrics
For real sequencing datasets, there is no ground truth available on the errors made by the sequencing instrument. Thus, for both HG002 and A. thaliana datasets, we rely on high-quality references on which the reads can be mapped. For HG002 datasets, we use HG002v1.1 assembly (GCA_018852605.3, GCA_018852615.3) released by the Telomere-to-telomere consortium. For A. thaliana datasets, we use a high-quality assembly555https://ngdc.cncb.ac.cn/gwh/Assembly/21820/show from [21].
To evaluate read correction performance, we aligned raw reads as well as reads corrected by different methods to their respective references using Minimap2. If a read aligns to multiple regions, then only its primary alignment was considered. We used BamConcordance [22] to examine the alignments. BamConcordance generates an empirical log-scaled measure of base error probabilities (Q-concordance) for each read. The formula used for calculating Q-concordance is , where identity refers to the sequence identity observed in the read alignment. As an example, Q-concordance corresponds to a sequence identity of , indicating that of the bases are erroneous. In addition to the Q-concordance statistics, we also report the percentage of reads containing at least mismatch (similarly, indel) errors, for , before and after correction.
As noted earlier in Section 5, the approach used in HALE and Herro discards a segment of a read if it cannot be reliably corrected due to insufficient overlapping read support. For a fair comparison, we evaluated only those reads that were fully corrected (i.e., not chopped) by all methods. Without this, HALE and Herro would have an unfair advantage over Hifiasm.
Error correction performance
We show a comparison of the accuracy of raw reads and those corrected by HALE, Herro, and Hifiasm, respectively, in Figure 4a. We achieved comparable accuracy as Herro and Hifiasm. All three correction methods led to more than a 10-fold improvement in read accuracy relative to the raw, uncorrected reads.
Next, we separately assessed mismatch and indel errors in the corrected reads. In Figure 4b, we highlight that the fraction of HG002 chr9 reads with one or more mismatch errors significantly drops after error correction. This result suggests that all three methods are effective in preserving biological variations in the reads. At sequencing depth, HALE exhibits a marginally higher percentage of reads containing at least mismatch errors () compared to Herro and Hifiasm. However, this difference diminishes at depth. We also tested HALE using reads from HG002 chromosomes 13 and 18, observing a consistent trend (Appendix Figure 6).
In the case of A. thaliana dataset, which is a haploid genome sample, HALE is again comparable to the best performing method (Figures 4d, 4e). Herro lags slightly behind HALE and Hifiasm here, possibly because its model was trained using a diploid sample and nanopore reads [18].
We also manually checked and visualized a few reads in IGV [16] where HALE failed to correct one or more errors. In some cases, we found that uneven sequencing coverage in some genomic regions resulted in MSA windows where fewer than true overlaps were present. As a result, our algorithm, constrained by the fixed cutoff of , selected substrings from false overlaps for correction.
Comparison with naive algorithms
Solving Problem 1 (Section 3) is an important component of HALE as it enables us to preserve the haplotype-specific variation in reads. We compared HALE with two simplified versions of our algorithm, Naive-1 and Naive-2, respectively, which omit this optimization step.
Comparison with Naive-1.
On both HG002 and A. thaliana datasets, the Naive-1 algorithm achieves Q-concordance scores comparable to HALE (Figure 5a). These results, especially on HG002 diploid datasets, surprised us initially. Recall that the Naive-1 algorithm applies the majority vote on all alignment columns, irrespective of whether they are informative or not. We further assessed the mismatch and indel error statistics. We show that HALE provides a significant improvement over the Naive-1 algorithm in correcting mismatch errors (Figure 5b). This confirms that a direct consensus approach eliminates haplotype variation, whereas our optimization based on solving Problem 1 helps to preserve them.
Despite this significant advantage in mismatch statistics, the Q-concordance scores in Figure 5a do not reflect a proportionate gain. This is because indel errors are the dominant error type in HiFi reads. As shown in Figure 5c, HALE performs comparably to Naive-1 when it comes to correcting indel errors.
Using the haploid A. thaliana dataset, the results are as expected. Naive-1 algorithm performs on par with HALE in terms of correcting mismatch errors (Figure 5d), and even slightly surpasses HALE when it comes to correcting indel errors (5e). A direct consensus-based approach works well on haploid datasets.
Comparison with Naive-2.
We see a drop in the Q-concordance scores using the Naive-2 algorithm in both HG002 and A. thaliana datasets. This is because the Naive-2 algorithm makes no modification in a target read on informative alignment columns. In practice, a small fraction of informative alignment columns are classified as informative due to multiple sequencing errors occurring at the same coordinate of the target read and its overlapping reads. In such a case, Naive-2 will leave the erroneous base of the target read uncorrected. As a result, we see a much larger number of indel errors in Naive-2 (Figures 5c, 5e).
Runtime comparison
Runtime comparisons between HALE, Herro, and Hifiasm are challenging for two main reasons. First, Hifiasm uses its own custom all-vs-all read overlapping algorithm that is tightly integrated inside its code, whereas HALE and Herro use an external tool, Minimap2 [7]. Second, Herro leverages GPU acceleration, whereas both HALE and Hifiasm run on CPUs. In the following, we report the wall-clock runtimes of all-vs-all read overlapping and error correction steps for HALE and Herro. For Hifiasm, we report the total end-to-end assembly time. We report the results using chr9 HG002 dataset.
Minimap2 used minutes to compute all-vs-all overlaps using CPU threads. HALE used minutes for error correction using the same number of threads. In comparison, Herro required minutes on a server with four NVIDIA A100 GPUs. Hifiasm completed the full assembly in minutes using CPU threads, with approximately of that time spent on read overlapping and error correction steps. Based on these results, we conclude that HALE is computationally more efficient than Herro. Hifiasm is the fastest overall due to its highly optimized read overlapping implementation. As all-vs-all overlap computation is a runtime bottleneck while using HALE, using a faster read overlapper would reduce the runtime.
7 Discussion
Genome assembly tools often require a significant software engineering effort and rely heavily on heuristics that are often undocumented. It is important to investigate alternative algorithms that are grounded in solid theoretical principles. Error correction is one of the critical steps in genome assembly, especially in the context of diploid genomes, polyploid genomes, and metagenomes.
In this work, we proposed the first rigorous formulation for haplotype-aware error correction of long reads. The formulation is partly inspired from the bi-clustering formulation introduced in [5]. An extension of the same optimization framework is also used in reference-guided haplotype phasing tools such as WhatsHap [15]. Our proposed formulation is designed to identify true overlaps of a read from a given set of overlaps. Considering that the problem is NP-hard, we also discussed practical heuristics to make our algorithm scalable. We showcased the effectiveness of our algorithm using publicly available long-read datasets from a haploid A. thaliana genome and a diploid human genome. We showed that HALE achieves accuracy on par with state-of-the-art tools, including Herro (a deep learning-based method) and Hifiasm (which relies on intricate heuristics). Lastly, we isolated the benefit of using Problem 1 (Section 3) by comparing HALE with the other simplified versions of our algorithm.
A number of improvements are possible to the proposed optimization framework as well as our implementation. One limitation of our formulation is that we independently process windows of an MSA (Figure 1) rather than processing the entire MSA at once. This can be problematic when a window is particularly noisy or lacks sufficient haplotype-specific variation, making it difficult to select the correct subset of overlapping reads. Future work could explore alternative formulations that exploit the full length of long reads. Another limitation is our reliance on parameters such as (count of overlapping reads to be selected) and the MSA window length. In this work, we primarily relied on our empirical observations to set these parameters. Developing either parameter-free approaches or a principled way to select these parameters which generalizes across haploid, diploid, and polyploid genomes would be an important next step.
Regarding the implementation of heuristics in HALE, the use of superior all-vs-all read aligners and MSA heuristics may further enhance read accuracy. Furthermore, it may be possible to improve the selection of informative alignment columns in an MSA such that a larger fraction of informative columns correspond to biological variation rather than sequencing errors. For example, indel variation in homopolymer regions is more likely to be a sequencing error than a biological variation [3].
HALE is theoretically promising, it is not yet a practical replacement for existing state-of-the-art tools. In future, we hope to continue improving HALE. We also hope to extend its applicability to nanopore sequencing reads, which have higher error rates than HiFi reads.
References
- [1] Dmitry Antipov, Mikko Rautiainen, Sergey Nurk, Brian P Walenz, Steven J Solar, Adam M Phillippy, and Sergey Koren. Verkko2 integrates proximity-ligation data with long-read de bruijn graphs for efficient telomere-to-telomere genome assembly, phasing, and scaffolding. Genome Research, 2025.
- [2] Haoyu Cheng, Gregory T. Concepcion, Xiaowen Feng, et al. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature Methods, 18:170–175, 2021.
- [3] Haoyu Cheng, Han Qu, Sean McKenzie, Katherine R. Lawrence, Rhydian Windsor, Mike Vella, Peter J. Park, and Heng Li. Efficient near telomere-to-telomere assembly of nanopore simplex reads. bioRxiv, 2025.
- [4] Uriel Feige. NP-hardness of hypercube 2-segmentation. arXiv preprint arXiv:1411.0821, 2014. arXiv:1411.0821.
- [5] Jon Kleinberg, Christos Papadimitriou, and Prabhakar Raghavan. Segmentation problems. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pages 473–482, 1998. doi:10.1145/276698.276860.
- [6] Christopher Lee, Catherine Grasso, and Mark F Sharlow. Multiple sequence alignment using partial order graphs. Bioinformatics, 18(3):452–464, 2002. doi:10.1093/BIOINFORMATICS/18.3.452.
- [7] Heng Li. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18):3094–3100, 2018. doi:10.1093/BIOINFORMATICS/BTY191.
- [8] Heng Li and Richard Durbin. Genome assembly in the telomere-to-telomere era. Nature Reviews Genetics, 25(9):658–670, 2024.
- [9] Ross Lippert, Russell Schwartz, Giuseppe Lancia, and Sorin Istrail. Algorithmic strategies for the single nucleotide polymorphism haplotype assembly problem. Briefings in bioinformatics, 3(1):23–31, 2002. doi:10.1093/BIB/3.1.23.
- [10] Yuansheng Liu, Yichen Li, Enlian Chen, Jialu Xu, Wenhai Zhang, Xiangxiang Zeng, and Xiao Luo. Repeat and haplotype aware error correction in nanopore sequencing reads with dechat. Communications Biology, 7:1678, 2024.
- [11] Glennis A Logsdon, Mitchell R Vollger, and Evan E Eichler. Long-read human genome sequencing and its applications. Nature Reviews Genetics, 21(10):597–614, 2020.
- [12] Eugene W Myers. The fragment assembly string graph. Bioinformatics, 21(suppl_2):ii79–ii85, 2005.
- [13] Fan Nie, Peng Ni, Yadong Huang, Yujie Zhang, Yadong Wang, and Jun Wang. De novo diploid genome assembly using long noisy reads. Nature Communications, 15:2964, 2024.
- [14] Sergey Nurk, Adam M. Phillippy, et al. The complete sequence of a human genome. Science (New York, N.Y.), 376(6588):44–53, 2022.
- [15] Murray Patterson, Tobias Marschall, Nadia Pisanti, Leo Van Iersel, Leen Stougie, Gunnar W Klau, and Alexander Schönhuth. Whatshap: weighted haplotype assembly for future-generation sequencing reads. Journal of Computational Biology, 22(6):498–509, 2015. doi:10.1089/CMB.2014.0157.
- [16] James T Robinson, Helga Thorvaldsdóttir, Wendy Winckler, Mitchell Guttman, Eric S Lander, Gad Getz, and Jill P Mesirov. Integrative genomics viewer. Nature biotechnology, 29(1):24–26, 2011.
- [17] Joao Carlos Setubal and Joao Meidanis. Introduction to computational molecular biology. PWS Pub. Boston, 1997.
- [18] Dominik Stanojević, Dehui Lin, Sergey Nurk, Paola Florez de Sessions, and Mile Šikić. Telomere-to-telomere phased genome assembly using herro-corrected simplex nanopore reads. bioRxiv, 2024.
- [19] James Joseph Sylvester. Lx. thoughts on inverse orthogonal matrices, simultaneous signsuccessions, and tessellated pavements in two or more colours, with applications to newton’s rule, ornamental tile-work, and the theory of numbers. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 34(232):461–475, 1867.
- [20] German Tischler-Höhle. Haplotype and repeat separation in long reads. In International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics, pages 103–114. Springer, 2017. doi:10.1007/978-3-030-14160-8_11.
- [21] Bo Wang, Xiaofei Yang, Yanyan Jia, Yu Xu, Peng Jia, Ningxin Dang, Songbo Wang, Tun Xu, Xixi Zhao, Shenghan Gao, et al. High-quality arabidopsis thaliana genome assembly with nanopore and hifi long reads. Genomics, proteomics & bioinformatics, 20(1):4–13, 2022. doi:10.1016/J.GPB.2021.08.003.
- [22] Aaron M Wenger, Paul Peluso, William J Rowell, Pi-Chuan Chang, Richard J Hall, Gregory T Concepcion, Jana Ebler, Arkarachai Fungtammasan, Alexey Kolesnikov, Nathan D Olson, and others. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nature biotechnology, 37(10):1155–1162, 2019.
- [23] H. Zhang, C. Jain, and S. Aluru. A comprehensive evaluation of long read error correction methods. BMC Genomics, 21:1–5, 2020.
Appendix
Appendix A Hardness Result: Supplementary
Proof.
Our proof uses a chain of arguments. We begin by defining Problem 8, a simplified version of Problem 1 where we have a single term in the objective function and a binary alphabet. Next, we introduce Problem 9, a modified version of Problem 3. To prove the lemma, we first prove Claim 10 to show that Problem 9 is NP-hard, then Claim 11 to establish polynomial-time equivalence between Problem 9 and Problem 8. Finally we complete the proof by proving Claim 12.
Problem 8.
Given a set of vectors in and a positive integer , compute a subset of cardinality and a center in such that is maximum.
Problem 9.
Given a set of vectors in and a positive integer , compute a subset of cardinality that maximizes the norm of the vector sum, i.e. .
Proof.
Problem 9 differs slightly from Problem 3 because we seek a subset of size exactly in Problem 9, whereas we seek a subset of size at least in Problem 3. We prove Claim 10 using a simple polynomial-time reduction from Problem 3 to Problem 9. Given an instance of Problem 3 with and parameter , we construct instances for Problem 9 using the same set of vectors and parameter , for all . The maximum of the solution over all instances of Problem 9 is the solution to Problem 3.
Proof.
We consider a solution set of vectors for Problem 8. Observe that the vector must satisfy , where ties are broken arbitrarily. Let be the coordinate-wise transformation defined by and . For any vector , let denote its image in . Let be the set of vectors with applied to each vector. The objective function of Problem 8 can be written as
Thus, maximizing the objective function of Problem 8 is equivalent to maximising .
Given a solution to Problem 8, i.e., a vector and subset of vectors that maximizes , we obtain a solution for Problem 9 by applying component-wise to every vector in , creating the set . By our earlier argument, is maximised. Conversely, given a solution for Problem 9 of the form , we apply to every vector in to obtain a solution for Problem 8. The center vector is obtained by taking a majority over each coordinate, and the objective function of Problem 8 (sum of agreement with ) is maximized.
Proof.
We prove this by polynomial-time reduction from Problem 8 to Problem 1 defined on alphabet (since this can be bijectively mapped to ). Given an instance of Problem 8 with set and parameter , construct instances of Problem 1 where the instance is constructed by considering as the target vector , set , and parameter . The maximum value solution of all instances of Problem 1 is the solution to Problem 8.
Lemma 13.
Given a fractional solution to the Max-Cut problem, there exists a rounding procedure that transforms it into an integral solution which is at least as large as that of the fractional solution.
Proof.
Let be the given fractional assignment of vertex , for . The extent to which an edge is cut is given by . Accordingly, the cut size of the given fractional partitioning is given by . Next, consider the following procedure to convert fractional assignment of vertices into an integral assignment. Let be a threshold chosen uniformly at random from the interval . Assign vertex to partition if and assign it to partition otherwise. For any edge , the probability that it is cut is given by because this occurs precisely when . Hence, the expected value of the cut size of our integral solution is , which is equal to the cut size of the fractional cut. This expectation guarantees the existence of at least one threshold for which the corresponding integral solution has value at least as large as the given fractional solution. Finally, observe that it suffices to check at only candidate values for : namely, at each , as well as the endpoints and . This is because the objective remains constant between any two consecutive values in the sorted sequence . Therefore, an optimal value can be found in polynomial time by checking the solution at only these points.
Appendix B Additional Results
| Dataset | Error type | # Errors | Percentage of reads | |||
|---|---|---|---|---|---|---|
| Raw | HALE | Herro | Hifiasm | |||
| HG002 chr9 40x | Mismatch | >0 | 31.3 | 3.7 | 1.6 | 1.3 |
| >1 | 22.7 | 1.9 | 0.7 | 0.6 | ||
| >2 | 17.6 | 1.4 | 0.6 | 0.6 | ||
| Indel | >0 | 95.6 | 22.6 | 14.9 | 9.7 | |
| >1 | 85.8 | 4.9 | 3.1 | 2.5 | ||
| >2 | 74.9 | 1.6 | 1.1 | 1.3 | ||
| HG002 chr9 60x | Mismatch | >0 | 33.4 | 1.9 | 1.2 | 1.0 |
| >1 | 24.3 | 0.5 | 0.2 | 0.1 | ||
| >2 | 19.1 | 0.3 | 0.0 | 0.0 | ||
| Indel | >0 | 96.0 | 23.1 | 13.2 | 7.9 | |
| >1 | 86.9 | 5.1 | 2.7 | 2.0 | ||
| >2 | 76.3 | 1.3 | 0.6 | 0.7 | ||
| A. thaliana 40x | Mismatch | >0 | 39.0 | 0.2 | 2.3 | 0.1 |
| >1 | 27.5 | 0.1 | 0.8 | 0.0 | ||
| >2 | 21.5 | 0.1 | 0.1 | 0.0 | ||
| Indel | >0 | 97.4 | 11.5 | 18.3 | 5.4 | |
| >1 | 91.5 | 1.7 | 3.0 | 0.5 | ||
| >2 | 84.3 | 0.4 | 0.6 | 0.1 | ||
| A. thaliana 60x | Mismatch | >0 | 39.0 | 0.2 | 1.9 | 0.2 |
| >1 | 27.5 | 0.1 | 0.6 | 0.1 | ||
| >2 | 21.6 | 0.1 | 0.1 | 0.0 | ||
| Indel | >0 | 97.4 | 12.3 | 22.7 | 3.9 | |
| >1 | 91.5 | 1.8 | 4.5 | 0.4 | ||
| >2 | 84.3 | 0.4 | 0.9 | 0.1 | ||
| Dataset | Error type | # Errors | Percentage of reads | |||
|---|---|---|---|---|---|---|
| Raw | HALE | Naive-1 | Naive-2 | |||
| HG002 chr9 40x | Mismatch | >0 | 31.3 | 3.8 | 22.3 | 7.8 |
| >1 | 22.7 | 2.0 | 13.9 | 4.0 | ||
| >2 | 17.6 | 1.5 | 10.2 | 1.9 | ||
| Indel | >0 | 95.6 | 23.1 | 29.0 | 76.0 | |
| >1 | 85.8 | 5.0 | 11.6 | 44.7 | ||
| >2 | 74.9 | 1.8 | 5.5 | 22.8 | ||
| HG002 chr9 60x | Mismatch | >0 | 33.4 | 1.8 | 8.9 | 5.7 |
| >1 | 24.3 | 0.7 | 3.5 | 2.7 | ||
| >2 | 19.1 | 0.4 | 2.0 | 0.9 | ||
| Indel | >0 | 96.0 | 23.5 | 16.3 | 72.4 | |
| >1 | 86.9 | 5.1 | 4.8 | 41.3 | ||
| >2 | 76.3 | 1.6 | 2.0 | 20.0 | ||
| A. thaliana 40x | Mismatch | >0 | 39.0 | 0.2 | 0.3 | 3.7 |
| >1 | 27.5 | 0.1 | 0.1 | 1.6 | ||
| >2 | 21.5 | 0.1 | 0.1 | 0.4 | ||
| Indel | >0 | 97.4 | 11.4 | 6.7 | 68.8 | |
| >1 | 91.4 | 1.8 | 1.0 | 40.2 | ||
| >2 | 84.3 | 0.5 | 0.4 | 21.9 | ||
| A. thaliana 60x | Mismatch | >0 | 39.0 | 0.2 | 0.2 | 3.3 |
| >1 | 27.5 | 0.1 | 0.1 | 1.4 | ||
| >2 | 21.6 | 0.1 | 0.1 | 0.3 | ||
| Indel | >0 | 97.4 | 12.2 | 5.3 | 71.8 | |
| >1 | 91.4 | 1.8 | 0.8 | 40.7 | ||
| >2 | 84.3 | 0.4 | 0.3 | 21.0 | ||
Appendix C Software Versions and Commands
| Tool | Version |
|---|---|
| SeqKit | 2.5.1 |
| Minimap2 | 2.26-r1175 |
| Samtools | 1.20 |
| Hifiasm | 0.25.0-r726 |
| HALE | 1.0.0 |
| HERRO | Commit: 204d29e |
| Bamconcordance | Commit: dded267 |
| Step | Command |
|---|---|
| Samtools (extract chr9 reads) | samtools view -b hg002_hifi_reads.bam chr9_MATERNAL chr9_PATERNAL | samtools fastq - > chr9_hg002_hifi_reads.fastq |
| SeqKit (read sampling) | seqkit sample -p <proportion> <input_fastq> -o <output_fastq> |
| All-vs-all overlap | minimap2 -K8g -cx ava-ont -k25 -w17 -e200 -r150 -m2500 -z200 -f 0.005 -t 64 --dual=yes <reads> <reads> |
| HERRO inference | ./herro inference --read-alns <directory_alignment_batches> -t 8 -d 0,1,2,3 -m model_R10_v0.1.pt -b 64 <fastq.gz_input> <fasta_output> |
| Hifiasm | ./hifiasm -t 64 -o output_prefix.asm --write-paf --write-ec temp <fastq.gz_input> |
| HALE | ./hale inference --read-alns <directory_alignment_batches> -m "hale" -t 64 <fastq.gz_input> <fasta_output> |
| Naive-1 | ./hale inference --read-alns <directory_alignment_batches> -m "consensus" -t 64 <fastq.gz_input> <fasta_output> |
| Naive-2 | ./hale inference --read-alns <directory_alignment_batches> -m "pih" -t 64 <fastq.gz_input> <fasta_output> |
| Mapping reads to reference | minimap2 -ax map-hifi --eqx -t 64 --secondary=no <reference> <fasta_input> | samtools view -@ 64 -S -b | samtools sort -@ 64 -o <bam_output> && samtools index -@ 256 <bam_output> |
| BamConcordance (Using bam output from mapping step) | hg002-ccs/concordance/bamConcordance <fasta_input> <bam_output> <csv_output> |
