Safe Sequences via Dominators in DAGs for Path-Covering Problems
Abstract
A path-covering problem on a directed acyclic graph (DAG) requires finding a set of source-to-sink paths that cover all the nodes, all the arcs, or subsets thereof, and additionally they are optimal with respect to some function. In this paper we study safe sequences of nodes or arcs, namely sequences that appear in some path of every path cover of a DAG.
We show that safe sequences admit a simple characterization via cutnodes. Moreover, we establish a connection between maximal safe sequences and leaf-to-root paths in the source- and sink-dominator trees of the DAG, which may be of independent interest in the extensive literature on dominators. With dominator trees, safe sequences admit an -size representation and a linear-time output-sensitive enumeration algorithm running in time , where and are the number of nodes and arcs, respectively, and is the total length of the maximal safe sequences.
We then apply maximal safe sequences to simplify Integer Linear Programs (ILPs) for two path-covering problems, LeastSquares and MinPathError, which are at the core of RNA transcript assembly problems from bioinformatics. On various datasets, maximal safe sequences can be computed in under 0.1 seconds per graph, on average, and ILP solvers whose search space is reduced in this manner exhibit significant speed-ups. For example on graphs with a large width, average speed-ups are in the range for MinPathError and in the range for LeastSquares. Optimizing ILPs using safe sequences can thus become a fast building block of practical RNA transcript assembly tools, and more generally, of path-covering problems.
Keywords and phrases:
directed acyclic graph, path cover, dominator tree, integer linear programming, least squares, minimum path errorCopyright and License:
2012 ACM Subject Classification:
Theory of computation Graph algorithms analysis ; Theory of computation Integer programming ; Applied computing Sequencing and genotyping technologiesSupplementary Material:
Software (Source Code): https://github.com/algbio/safe-sequencesarchived at
swh:1:dir:0e6e4d5e013053c8ba853d335c750cff701e09fa
Acknowledgements:
We are extremely grateful to Andrey Prjibelski for producing the IsoQuant graphs and to Manuel Cáceres for fruitful discussions on safe sequences.Funding:
††margin:
This work has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 851093, SAFEBIO), and from the Research Council of Finland grants No. 346968, 358744. Co-funded by the European Union (ERC, SCALEBIO, 101169716). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them.
Editors:
Anne Benoit, Haim Kaplan, Sebastian Wild, and Grzegorz HermanSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
Data reduction, namely simplifying the input to a problem while maintaining equivalent solutions, has proved extremely successful in solving hard problems in both theory and practice [26]. While the most well-known approach for data reduction is kernelization [37, 18], there are also other ways to pre-process the input in order to obtain practical algorithms.
For example, for graph problems where a solution is a set of vertices with certain properties, Bumpus et al. [8] and Jansen and Verhaegh [29] considered the notion of -essential vertex as one belonging to all -approximate solutions to the problem. They showed that for some vertex-subset finding problems (e.g. Vertex Cover) and some values of , there exist algorithms with an exponential dependency only in the number of non--essential vertices.
In this paper we pursue a similar data reduction approach, via the notion of safe partial solution [52], defined as one “appearing” in all solutions to a problem. For solutions that are sets of paths satisfying certain properties, a safe path can be defined as one appearing as a subpath of some path of any solution. Intuitively, safety corresponds to -essentiality.
For an NP-hard problem, it is usually intractable also to compute all its safe partial solutions (see e.g. [15]). As such, in this paper we use the following property about safety. Let be the (unknown) set of solutions to a problem, and let be a superset of whose safe partial solutions can be computed efficiently. Then, if a partial solution appears in all solutions in , it also appears in all solutions in .
This property was used by Grigorjew et al. [25] when applying safety as a pre-processing step for the Minimum Flow Decomposition (MFD) problem. The MFD problem asks to decompose the flow in a directed acyclic graph (DAG) into a minimum number of weighted source-to-sink paths [4, 53]. As solution superset , they considered flow decompositions with any number of paths, whose safe paths can be computed in polynomial time [30, 31, 2]. Instead of simplifying the input graph, they used the safe paths to simplify an integer linear program (ILP) for the MFD problem from [17], thus reducing the search space of the ILP solver. This resulted in speed-ups of up to 200 times on the hardest instances.
In this paper we focus on problems whose input consists of an - DAG222By an - directed graph we mean one with a unique source and unique sink such that all nodes are reachable from and all nodes reach ; if the graph is acyclic, then it is an - DAG. Path-covering problems on DAGs typically require that paths either start in a source of the graph, or more generally, in a given set of starting nodes, and analogously end in a sink, or in a given set of terminal nodes. However, these start/end nodes can naturally be connected to and , and thus we can assume, without loss of generality, that our DAGs are - DAGs. where the solution superset contains all the sets of paths in from to that together cover all the nodes or arcs of (or given subsets thereof). We will refer to these objects simply as path covers. Note that a flow decomposition is also a path cover of the arcs carrying flow, but path covers generalize many other problems, especially those modeling practical problems from bioinformatics, see Section 1.2. Next, we present the contributions of the paper and further contextualization.
1.1 Optimal enumeration of maximal safe sequences via dominators
Safe sequences as a generalization of safe paths.
The notion of a safe path has been extensively used to capture contiguous safe partial solutions (see e.g. [52, 48, 15, 40, 45, 31, 11]). However, for our purposes contiguity is not necessary, but only the fact that the nodes/arcs in a safe partial solution appear in the same order in some path of every solution. For example, it may be that the only way to reach a node is via a node , with some complex subgraph between them. The sequence thus appears in some path of every path cover of the nodes.
We thus define a safe sequence with respect to the path covers of a DAG , either of the nodes or of the arcs, analogously to that of a safe path (see Figure 1 for an example). This notion is inspired by the work of Ma, Zheng and Kingsford [57], who proposed a problem (AND-Quant) and a max-flow-based algorithm deciding when a sequence of arcs appears in some path in every flow decomposition (with any number of paths) of a flow in a DAG. Here, we are particularly interested in characterizing but also in computing maximal safe sequences, i.e., safe sequences that are not a proper subsequence of another safe sequence.
Characterization of safe sequences.
Assume that the input DAG has nodes and arcs. We show that safe sequences admit a simple characterization in terms of cutnodes. A - cutnode is a node such that all paths from to (- paths) pass through it. We present our results for path covers of the nodes and refer to the extended version for the results for path covers of the arcs. Omitted proofs can be found in the extended version.
Theorem 1 (Characterization of safe sequences).
Let be an - DAG and be a sequence of nodes. The following statements are equivalent:
-
1.
is safe for path covers.
-
2.
There is a node such that every - path covering contains .
-
3.
There is a node such that is contained in the sequence obtained as the concatenation of the sequences of - and - cutnodes.
While outputting the sequences of and cutnodes for every node in the graph can be done in time per node, different nodes can thus be extended to the same maximal safe sequence, and some extensions may not even be maximal (see Figure 1). The challenge is to characterize those nodes identifying all and only maximal safe sequences, without duplicates.
Linear-time output-sensitive enumeration via dominators.
We show that we can identify such nodes in an elegant manner using dominators, which is a well-studied concept related to cutnodes (see e.g. the survey [23] for an in-depth contextualization of dominators). In particular, the -dominator tree of an - DAG has node set , and is the parent of iff all paths from to pass through , and moreover there is no node reachable from (except itself) whose removal makes unreachable from .
The key idea behind our results is to work also with the -dominator tree of , which is defined in an analogous manner, but considering reachabilities to . By analyzing the interplay between the two trees, we show that the nodes solving this problem (Theorem 9) are those that are leaves in both the - and -dominator trees, only after overcoming some technicalities arising from unitary paths (see Figure 1, Lemma 4 and Lemma 8). Since dominator tree leaves admit a characterization in terms of in- or out-neighborhoods on DAGs (Lemma 5), this leads to an -time algorithm outputting all and only maximal safe sequences without duplicates, independently of dominator trees:
Theorem 2.
Given an - DAG with nodes and arcs, we can compute all and only the maximal safe sequences of in -time, without constructing dominator trees.
To improve this time bound we resort to the fact that dominator trees admit an time construction [6, 7, 20], although so far some of these algorithms do not seem to be practical. See [24, 22, 14] for experimental studies on different dominator tree algorithms. Regardless, any dominator algorithm serves our purposes and its effect on the runtime of our algorithm is clear in the proof of the theorem below.
Theorem 3 (Enumeration of maximal safe sequences).
Let be an - DAG with nodes and arcs. The following hold:
-
1.
There is an time algorithm outputting the set of all maximal safe sequences with no duplicates, where denotes the total length of all the maximal safe sequences.
-
2.
There is an -size representation of all the maximal safe sequences of that can be built in time.
An analogous result to Theorem 3 can be obtained for path covers where only a given subset of nodes has to be covered. To achieve that, we have to analyze the dominator trees induced by the dominance relation restricted to . Moreover, analogous results can be obtained also for path covers of the arcs. The details can be found in the extended version.
1.2 Scaling path-covering ILPs for bioinformatics problems
The motivation of our work comes from bioinformatics, where many NP-hard graph problems related to sequencing data require finding a set of paths in a DAG satisfying certain properties. For example, in the RNA transcript assembly problem, we need to recover the RNA transcripts produced in various abundances by a gene, given only smaller fragments (i.e. RNA-seq reads) sequenced from the RNA transcripts. This problem is usually solved by first building an arc-weighted DAG from the fragments, i.e., a splice graph. The RNA transcripts correspond to a set of source-to-sink weighted paths in the DAG that “best explain” the arc-weights, under various definitions of optimality, see e.g. [36, 35, 56, 46, 55, 47, 50, 19, 17, 15].
In our work we consider two path-covering problems used for RNA transcript assembly, in which we use safe sequences to fix some variables of the corresponding ILPs. We assume that in every arc of the input graph there is an associated positive weight . We now briefly describe the ILPs.
We selected a classical least-squares model (which we call LeastSquares) that is at the core of several RNA assembly tools e.g. [34, 12, 21, 33, 38, 51]. This minimizes the sum of squared differences between the weight of an arc and the weights of the solution paths passing through the arc. The ILP formulation requires no additional constraints, and the ILP objective function is:333Note that the terms are not linear, but can be linearized using a simple technique, see [17, 16].
| (1) |
While this model is popular, it has quadratic terms in its objective function, which makes it harder to solve. Therefore, as second model (which we call MinPathError), we chose one that was recently introduced in [16]. Then, for paths to be a solution to the ILP, we require that the absolute difference between the weight of an arc and the weights of the solution paths passing through the arc must be below the sum of the errors (or slacks) of the paths passing through the arc. Formally, we add the constraint:444Note again that the terms and are not linear, but can be linearized using a simple technique, see [16].
| (2) |
with objective function: . By definition, the solution paths must cover every arc , since otherwise the above condition becomes , which contradicts the assumption that is positive everywhere. See [17] for further details.
On various datasets, maximal safe sequences can be computed in under 0.1 seconds per graph, on average, and such safety-optimized ILPs show significant speed-ups over the original ones. More specifically, on the harder instances of graphs with a large width, average speed-ups are in the range for MinPathError and in the range for LeastSquares. As such, our optimizations can become a scalable building block of practical RNA transcript assembly tools, but also of general path-covering problems.
2 Notation and preliminaries
Graphs.
In this paper we consider directed graphs, especially directed acyclic graphs (DAGs). For a DAG with node set and arc set , we let and . We denote an arc from to as and a path through nodes as . Two paths are considered distinct if they differ in at least one node. The set of in-neighbors (out-neighbors) of a node is denoted (), and its indegree (outdegree) as (). For nodes we say that there is a - path if there is a path whose first node is and last node is (we also say that reaches ). An antichain is a set of pairwise unreachable nodes. The size of the largest antichain in a DAG is called width. Analogously, for arcs we say that reaches if reaches . An arc-antichain is a set of pairwise unreachable arcs, and the arc-width of a DAG is the size of the largest arc-antichain of the graph.
For we denote a sequence through nodes as if there is a - path for all ; in particular, a path is a sequence. We say that covers or contains a node if for some , and we write . If is a sequence and each node of is contained in we say that is a subsequence of , or that is contained in . Since is a DAG, the nodes of appear in the same order as in .
Let be a sequence such that there is a path from the last node of to . The concatenation of and is simply the sequence obtained as followed by . If , then the repeated occurrence of is removed, i.e. . We adopt the convention that lowercase letters denote nodes and uppercase letters denote sequences of nodes.
Safety.
We define a path cover of to be a set of - paths such that ; we also say cover instead of “- path cover of ” when the graph is clear from the context. Two path covers are distinct if they differ in at least one path. We assume that every node of lies in some - path. We say that a sequence is safe for if in every path cover of it holds that is a subsequence of some path in . If is not safe we say that it is unsafe. A sequence is maximally safe if it is not a subsequence of any other safe sequence. We assume that there is always an - path containing a given sequence of nodes, otherwise that sequence is vacuously unsafe. Note that imposing the collection of - paths of a cover to be minimal with respect to inclusion results in an equivalent definition of safety: any sequence that is safe for path covers is also safe for minimal path covers since every minimal path cover is a path cover; but also, if is safe for minimal path covers then it is also safe for general path covers, as otherwise we could have a path cover avoiding from which arbitrary - paths can be removed until becomes minimal.
Graph compression.
We define a unitary path as a path where each node has indegree equal to one but the first one, and each node has outdegree equal to one but the last one. For example, in Figure 1, the paths and are unitary, the path is maximally unitary. With respect to , we define the compressed graph of as the graph where every maximal unitary path of is contracted into a single node. It is a folklore result that compressing a graph can be done in linear time (see e.g. [13, 52]). We state this fact in the next lemma together with an additional property stating that compressed graphs cannot have arcs with the unique out-neighbor of and the unique in-neighbor of , since would then be a unitary path.
Lemma 4.
Let be a directed graph with nodes and arcs. Then the following hold:
-
1.
We can compute the compressed graph of in time.
-
2.
In a compressed graph there is no arc such that and .
The compression of unitary paths preserves safety: every - path in traversing the unitary path traverses the corresponding node in the compressed graph of , and every - path traversing a node in the compressed graph of necessarily traverses the unitary path it corresponds to in . This idea appeared in the context of walk-finding problems in [11].
Cutnodes and dominators.
In the dominator literature, the directed graphs on which dominators are computed have a single start node from which every node is reachable. In some cases, the graphs have instead or additionally a unique sink which is reached by every node, in which case the dominance relation is referred to as post-dominance (see e.g., [5, 27]). A prominent application of dominators is in the context of program analysis, where corresponds to the entry point of the program and to its exit point. Our graphs always have unique abstract nodes and , hence, diverging from standard notation, we parameterize the domination relation by and . We borrow standard concepts and notation from the literature of dominators, which we describe next.
Let be an - directed graph. A node is said to -dominate a node if every - path contains , hence is an - cutnode if and only if -dominates . Every node - and -dominates itself. We also say that -dominates if every - path contains , which is equivalent to being a - cutnode. We say that strictly -dominates if -dominates and . For , the immediate -dominator of is the strict -dominator of that is dominated by all the nodes that strictly dominate , in which case we write . The domination relation on with respect to can be represented as a rooted tree at with the same nodes as , called the -dominator tree of . In this tree, the parent of every node is its immediate -dominator, every node is -dominated by all its ancestors and -dominates all its descendants and itself. A node is said to be an -dominator if it strictly -dominates some node. The same terminology will be used for the -domination relation. We also define a function on rooted trees as follows. Let when and when . If is larger than the number of strict dominators of a given node, then it defaults to (resp. ) in the -dominator tree (resp. -dominator tree). Essentially, gives the -th ancestor of a node in a rooted tree, if it is well defined, otherwise it returns the root of the tree. For example, in Figure 1, the immediate -dominator of is () and . Node is a strict -dominator of the nodes . See [23] for an in-depth presentation of dominators.
In DAGs, dominators admit a characterization, which we state in Lemma 5 below. This was first presented by Ochranová [41], but with an incomplete proof. This characterization was also discussed by Alstrupt et al. [6, Sec. 5.1]. To the best of our knowledge, we did not find a complete proof of this characterization in the literature, and hence one can be found in the extended version.
Lemma 5 (Characterization of node-dominators in DAGs).
Let be an - DAG. A node is an -dominator (resp. -dominator) if and only if there is a node such that (resp. ).
3 Safe sequences via dominators
3.1 Characterization of safe sequences
We start by proving Theorem 1.
Proof of Theorem 1.
If for every node there is an - path covering that avoids , then we can build a path cover by considering for every node of an - path avoiding that covers . This contradicts the safety of .
Since every - path covering contains the sequence of - cutnodes followed by the sequence of - cutnodes and is contained in every such path, the implication follows by taking .
Since is a subsequence of a sequence containing every - and - cutnode and must be covered in every path cover by some - path, any of which contains the - and -cutnodes of , it follows that is safe.
We will heavily use the type of sequence obtained as in point 3 of Theorem 1. As such, we define the extension of a node , , as the sequence obtained from the concatenation of the sequence of - cutnodes with the sequence of - cutnodes. By Theorem 1 every maximal safe sequence is the extension of some node, and so in a DAG with nodes and arcs there are at most maximal safe sequences. Moreover, for any node , can be computed in time by computing the - cutnodes and - cutnodes with e.g. the simple algorithm of Cairo et al. [9]. Thus, Theorem 1 implies that we can compute in time a set of safe sequences containing every maximal safe sequence of , by reporting for every node . However, the reported sequences may not be maximal and the same sequence may be reported multiple times.
Because of the correspondence between cutnodes and dominators explained in Section 2, the extension of a node can also be expressed as suitable paths in dominator trees. We will use this fact in the rest of the paper.
Remark 6.
In an - DAG , is the same as the path from to in the -dominator tree of concatenated with the path from to in the -dominator tree of .
3.2 Properties of s- and t-dominator trees
We start with a lemma showing an interplay between -domination and -domination on general - directed graphs.
Lemma 7.
Let be an - directed graph, let be nodes, and let . If is the -th ancestor of in the -dominator tree, then -dominates , and is not a -dominator of unless .
Proof.
Note that since is an -dominator. Let and let be the nodes between and in the -dominator tree. We can assume .
Observe that for all . Suppose otherwise. Then because is not an -dominator. Hence, by definition of , there are nodes between and in the -dominator tree. Some is not between and in the -dominator tree, and hence there is a - path avoiding . Further, since is strictly -dominated by there is a - path avoiding . Concatenating these paths at results in a - path avoiding , contradicting the fact that -dominates .
Suppose now for a contradiction that is not strictly -dominated by . From the point above, there is a - path avoiding . This path concatenated with a - path avoiding gives a - path also avoiding , contradicting the fact that -dominates .
It remains to show that cannot -dominate . Suppose otherwise. Then we have for some , since is strictly -dominated by and -dominates . Thus, some is not between and in the -dominator tree, implying that there is a - path avoiding . Since is -dominated by , there is an - path avoiding , which concatenated with the previous path gives an - path also avoiding , contradicting the fact that -dominates .
Lemma 7 can be interpreted for the special case of as follows: if immediately -dominates then the immediate -dominator of -dominates (see Figure 2). Another consequence of this result is that if is an -dominator, then at most one node immediately -dominated by is in the path from to in the dominator tree of , which must be the immediate -dominator of .
Observe now in Figure 1 that is the immediate -dominator only of and that is the immediate -dominator only of , and likewise for and , and so . This redundancy and asymmetry introduced by unitary paths is undesirable in the characterization of maximal safe sequences. This discussion motivates the following result.
Lemma 8.
Let be a compressed - DAG and let be distinct nodes. If immediately -dominates and immediately -dominates then immediately -dominates some node different from and immediately -dominates some node different from .
Proof.
Assume without loss of generality that immediately -dominates only node .
We claim that . Suppose otherwise. If is empty then , and so is not a -dominator, a contradiction. So there is an arc with . Since -dominates , it must also -dominate every in-neighbour of , namely . But immediately -dominates only , thus, must -dominate , implying that there is a - path. Since , contains a cycle, a contradiction.
Since is a -dominator, by Lemma 5 there is a node whose unique out-neighbour is . If this node is , we contradict the fact that the graph is compressed by Lemma 4. Therefore, must have at least two distinct in-neighbors, a contradiction.
3.3 Characterization and enumeration of maximal safe sequences
Theorem 9 (Characterization of maximal safe sequences).
Let be a compressed - DAG and let be a node. Then is a maximal safe sequence if and only if is a leaf in both the - and -dominator trees of G.
Proof.
If is a leaf in both dominator trees then is maximal: no other node can be added to this sequence by the equivalence of safe sequences and paths in the dominator trees given by Remarks 6 and 1.
Suppose that is a maximal safe sequence but is not a leaf in both dominator trees. Without loss of generality suppose that is not a leaf in the -dominator tree. Note that since is an -dominator. Let be a node immediately -dominated by and let denote the immediate -dominator of .
First we argue that there is a node that is immediately -dominated by and that is strictly -dominated by . We consider two cases. If then we can take , because by Lemma 7 for we can conclude that -dominates , and since , it strictly -dominates . If then we can take , since by Lemma 8 there is a node that is immediately -dominated by ; applying Lemma 7 to and with , we conclude that -dominates , and since , strictly -dominates .
Now we show that properly contains . Consider , which by definition consists of the sequence of - cutnodes (call it ) concatenated with the sequence of - cutnodes (call it ). In , by definition, the double occurrence of is removed, and thus let be the sequence of - cutnodes. That is, can be written as . Consider also , which can be written analogously as . Since is the parent of in the -dominator tree, we have that is strictly contained in , and since is a proper descendant of in the -dominator tree, we have that is contained in (possibly ). Therefore, is a proper subsequence of , contradicting our assumption that is a maximal safe sequence.
The idea in the above theorem of targeting special nodes or arcs in order to characterize every maximal safe object is a recurring theme in safety problems. For instance, [31, 2] proposed the notion of representative arc, [48] proposed that of core, and [10] found a special class of walks, coined macrotigs, from where every maximal safe walk can be built. Each of these constructs was used in a different safety problem.
While the above characterization uses dominator trees, we can derive an equivalent characterization of dominator-tree leaves just in terms of in- and out-neighborhoods via Lemma 5, since a node is a leaf if and only if it is not a dominator. Computing the extension of every node satisfying such condition leads to a simple time algorithm outputting all and only maximal safe sequences, without needing to compute the dominator trees, from which Theorem 2 follows.
Proof of Theorem 3.
Let be the compressed graph of . Build the dominator trees of and compute all the leaves common to both trees. These three steps take time.
1: For every node that is a leaf in both dominator trees, output . By Theorem 9 we find every maximal safe sequence during this process. Further, no duplicate safe sequence is computed since different leaves common to both trees have different extensions. Therefore the algorithm runs in time equal to the length of all the maximal safe sequences plus the time required to build two dominator trees over , so altogether.
4 Experiments
The experiments were performed on an AMD 32-core machine with 512GB RAM. The source code of our project is publicly available on Github555https://github.com/algbio/safe-sequences. All our algorithms are implemented in Python3, as are the scripts to produce the experimental results. To solve ILPs, we use Gurobi’s Python API (version 11) under an academic license. For every graph, we run Gurobi with 4 threads and set up a timeout of 300 seconds in Gurobi’s execution time.
Implementation.
As described in Section 1.2, we study experimentally the potential of applying safe sequences in two arc-path covering problems, LeastSquares and MinPathError.
We did not use an external algorithm to compute arc-dominator trees because these are mostly tailored for node-dominators. Instead, we compute the immediate -dominator and -dominator of all arcs with an time procedure via the algorithm of [9], appropriately handling the reversed graph of for the immediate -domination relation. The classical algorithm of Aho and Ulman [3] to build dominator trees has the same running time as our procedure. We remark that even with a suboptimal algorithm to compute the dominator trees, our entire safety-preprocessing step takes less than 0.1 seconds on average.
With the dominator trees, we can compute all the maximal safe sequences with respect to arbitrary subsets of arcs to be covered, all in time proportional to the total length of all the maximal safe sequences, completely analogously to the discussion in Section 3.
The ILPs were optimized as in [25] by fixing variables to 1 based on the maximal safe sequences of the input DAG, which reduces the search space of the linear program and hence speeds-up the solver. Let be sequences in the input DAG such that: (a) each , , must appear in some solution path of the ILP, and (b) there exists no path in the DAG containing two distinct and . In the case that every is a path, Grigorjew et al. [25] noticed that if the two mentioned properties hold, then must appear in distinct paths of among the solution paths of the ILP (see the three solid-colored sequences shown in Figure 1 which must appear in different paths of any path cover). As such, without loss of generality, one can set for each arc of , for all . When using sequences to set binary variables, we can proceed in a completely analogous manner. To have a large number of binary variables set to 1, we follow the approach of [25]. Namely, for every arc , we define its weight as the length of the longest safe sequence containing . Then we find a maximum-weight antichain of arcs in the DAG using the min-flow reduction of [39]. Then, for each in the maximum-weight antichain, we consider a longest safe sequence containing to fix variables to 1. Note that a path covering two such sequences and would witness that and cannot be in an antichain. Finally, we set to the arc-width of the graph and run Gurobi to solve the ILP.
Note that any solution to MinPathError is also a path cover, since each arc weight is positive, and thus the simplified ILP has equivalent optimal solutions. This may not be the case for LeastSquares, since some arcs may not be covered by any solution path. To handle such scenarios we propose using maximal safe sequences for path covers where only a subset of nodes or arcs can be trusted to appear in any optimal solution. As an illustration of this approach, we perform experiments by choosing the subset as those arcs whose weight is at least as large as the 25-th percentile of the weights of all arcs, since we can heuristically infer that the solution paths of LeastSquares cover the arcs of large enough weight.
Datasets.
We experiment with seven datasets. The first four contain splice graphs with erroneous weights created directly from gene annotation by [16].666Available at https://doi.org/10.5281/zenodo.10775004 These were created starting from the well-known dataset of Shao and Kingsford [49], which was also used in several other studies benchmarking the speed of minimum flow decomposition solvers [17, 32, 54]. The original splice graphs were created in [49] from gene annotation from Human, Mouse and Zebrafish. This dataset also contains a set of graphs (SRR020730) from human gene annotation, but with flow values coming from a real sequencing sample from the Sequence Read Archive (SRR020730), quantified using the tool Salmon [42]. To mimic splice graphs constructed from real read abundances, [16] added noise to the flow value of each arc, according to a Poisson distribution. The last three datasets contain graphs constructed by a state-of-the-art tool for RNA transcript assembly, IsoQuant [44]. Mouse annotated transcript expression was simulated according to a Gamma distribution that resembles real expression profile [44]. From these expressed transcripts, PacBio reads were simulated using IsoSeqSim [1] and fed to IsoQuant for graph creation (we call this Mouse PacBio), and ONT reads were simulated using Trans-Nanosim [28], which was modified to perform a more realistic read truncation (see [44]), and fed to IsoQuant for graph creation (we call this Mouse ONT). For the latter one, there is also a simpler version where no read truncation was introduced (Mouse ONT.tr). These are available at [43].
Discussion.
We group the graphs in each dataset by width (column “”). For each width-interval we show different metrics, namely the number of graphs “g”, the average number of arcs “avg ” (and the maximum number of arcs in parenthesis), the time taken in seconds by the safety preprocessing step in “prep”, the percentage of fixed arc-variables of the ILP out of all variables in “vars”, the number of solved instances by the ILP with and without safety in “solved” (and in parenthesis the average time taken in seconds to solve those instances), and in the column “” we show the average speedup obtained by advising the linear program with safety. The speed-ups are computed by considering for each graph the ratio between the time taken by the original ILP (or 300 sec. if this timed-out) divided by the time taken by the optimized ILP, and averaging these ratios. In the “prep” and “vars” column, a dash is written only if no instance was solved with safety information. In the remaining columns a dash means not applicable.
In Table 1 we show the experimental results of the safety-optimized ILPs for MinPathError.(As explained before, for this problem it is correct to select .) When the width of the graph is small, the speed-ups are also small. However, such graphs are fast to solve without safety in the first place. The challenging cases are the graphs of larger widths and more arcs (indeed, the solver with no safety information performs poorly and it becomes slower as the width increases). The speed-up of the solver significantly increases as the width of the graph increases, becoming more than 200 in some datasets. Moreover, these speed-up are obtained almost for free, in the sense that the time needed to compute safe sequences is negligible for all datasets, i.e. below 0.1 seconds, and often even much faster. We highlight the notable difference between the number of instances solved with no safety and with safety. In Table 2 for LeastSquares we observe the same phenomena, obtaining in some datasets speed-ups of 300 on graphs of larger widths, and solving always more instances with safety than without safety. In the extended version results for in the MinPathError and for in the LeastSquares can be found.
| #g | avg (max ) | prep (s) | vars (%) | #solved (avg time (s)) | ||||
|---|---|---|---|---|---|---|---|---|
| no safety | safety | |||||||
| Zebrafish | 1-3 | 15405 | 14 (210) | 0.003 | 87.5 | 15405 (0.011) | 15405 (0.011) | 1.2 |
| 4-6 | 239 | 35 (156) | 0.007 | 29.5 | 239 (1.927) | 239 (0.095) | 14.6 | |
| 7-9 | 6 | 70 (158) | 0.015 | 15.3 | 4 (71.070) | 6 (0.645) | 244.3 | |
| 10+ | 1 | 81 (81) | 0.016 | 13.1 | 0 (-) | 1 (3.421) | 88.1 | |
| Human | 1-3 | 10729 | 15 (331) | 0.003 | 78.9 | 10729 (0.020) | 10729 (0.014) | 1.4 |
| 4-6 | 947 | 35 (163) | 0.008 | 24.7 | 944 (4.031) | 947 (0.147) | 19.0 | |
| 7-9 | 92 | 54 (105) | 0.012 | 14.1 | 63 (63.346) | 92 (3.924) | 116.6 | |
| 10+ | 12 | 80 (104) | 0.019 | 11.7 | 2 (189.233) | 9 (32.264) | 35.0 | |
| Mouse | 1-3 | 12280 | 14 (128) | 0.003 | 84.4 | 12280 (0.014) | 12280 (0.011) | 1.3 |
| 4-6 | 749 | 35 (126) | 0.007 | 27.4 | 749 (2.295) | 749 (0.121) | 15.0 | |
| 7-9 | 60 | 58 (231) | 0.015 | 17.5 | 43 (52.717) | 60 (4.955) | 73.3 | |
| 10+ | 9 | 79 (121) | 0.014 | 15.2 | 1 (262.851) | 7 (34.163) | 69.2 | |
| SRR020730 | 1-3 | 35069 | 9 (186) | 0.002 | 84.7 | 35069 (0.015) | 35069 (0.010) | 1.4 |
| 4-6 | 4497 | 33 (171) | 0.007 | 20.4 | 4496 (2.043) | 4497 (0.124) | 12.7 | |
| 7-9 | 1008 | 52 (131) | 0.010 | 11.9 | 844 (42.962) | 1007 (1.692) | 117.1 | |
| 10+ | 296 | 75 (184) | 0.015 | 7.9 | 80 (63.971) | 289 (22.945) | 142.5 | |
| Mouse ONT | 1-3 | 18527 | 11 (129) | 0.003 | 81.4 | 18527 (0.013) | 18527 (0.011) | 1.3 |
| 4-6 | 3083 | 31 (387) | 0.007 | 30.2 | 3083 (0.203) | 3083 (0.041) | 4.8 | |
| 7-9 | 762 | 48 (157) | 0.010 | 18.1 | 755 (2.495) | 762 (0.180) | 38.7 | |
| 10+ | 442 | 87 (589) | 0.021 | 11.1 | 372 (18.194) | 437 (0.611) | 266.0 | |
| Mouse ONT.tr | 1-3 | 18029 | 12 (131) | 0.003 | 81.1 | 18029 (0.011) | 18029 (0.011) | 1.1 |
| 4-6 | 3028 | 32 (187) | 0.007 | 30.6 | 3028 (0.187) | 3028 (0.042) | 3.8 | |
| 7-9 | 803 | 50 (419) | 0.012 | 19.0 | 796 (2.344) | 803 (0.182) | 21.9 | |
| 10+ | 476 | 89 (549) | 0.022 | 11.5 | 403 (17.204) | 472 (0.982) | 162.1 | |
| Mouse PacBio | 1-3 | 14256 | 14 (382) | 0.003 | 83.5 | 14256 (0.010) | 14256 (0.011) | 1.2 |
| 4-6 | 1376 | 33 (187) | 0.006 | 33.3 | 1376 (0.196) | 1376 (0.040) | 5.2 | |
| 7-9 | 182 | 49 (159) | 0.009 | 20.5 | 181 (4.083) | 182 (0.144) | 48.1 | |
| 10+ | 63 | 109 (1178) | 0.026 | 13.3 | 53 (22.188) | 62 (2.557) | 210.3 | |
| #g | avg (max ) | prep (s) | vars (%) | #solved (avg time (s)) | ||||
|---|---|---|---|---|---|---|---|---|
| no safety | safety | |||||||
| Zebrafish | 1-3 | 15405 | 14 (210) | 0.003 | 83.1 | 15405 (0.031) | 15405 (0.023) | 1.5 |
| 4-6 | 239 | 35 (156) | 0.007 | 18.1 | 224 (13.295) | 238 (3.061) | 15.9 | |
| 7-9 | 6 | 70 (158) | 0.011 | 8.4 | 0 (-) | 2 (62.840) | 4.8 | |
| 10+ | 1 | 81 (81) | - | - | 0 (-) | 0 (-) | - | |
| Human | 1-3 | 10729 | 15 (331) | 0.003 | 74.6 | 10729 (0.046) | 10729 (0.031) | 1.8 |
| 4-6 | 947 | 35 (163) | 0.007 | 16.3 | 862 (17.153) | 930 (6.334) | 12.9 | |
| 7-9 | 92 | 54 (105) | 0.011 | 9.6 | 4 (69.195) | 53 (71.207) | 29.1 | |
| 10+ | 12 | 80 (104) | - | - | 0 (-) | 0 (-) | - | |
| Mouse | 1-3 | 12280 | 14 (128) | 0.003 | 80.0 | 12280 (0.035) | 12280 (0.026) | 1.6 |
| 4-6 | 749 | 35 (126) | 0.007 | 16.9 | 696 (14.851) | 744 (5.917) | 12.4 | |
| 7-9 | 60 | 58 (231) | 0.015 | 9.9 | 0 (-) | 16 (64.253) | 121.9 | |
| 10+ | 9 | 79 (121) | - | - | 0 (-) | 0 (-) | - | |
| SRR020730 | 1-3 | 35069 | 9 (186) | 0.002 | 83.2 | 35069 (0.026) | 35069 (0.017) | 1.7 |
| 4-6 | 4497 | 33 (171) | 0.007 | 15.5 | 4428 (13.082) | 4494 (1.275) | 16.7 | |
| 7-9 | 1008 | 52 (131) | 0.011 | 8.9 | 269 (127.595) | 899 (25.515) | 82.7 | |
| 10+ | 296 | 75 (184) | 0.015 | 6.8 | 1 (7.363) | 104 (61.880) | 68.0 | |
| Mouse ONT | 1-3 | 18527 | 11 (129) | 0.003 | 75.9 | 18527 (0.020) | 18527 (0.017) | 1.6 |
| 4-6 | 3083 | 31 (387) | 0.007 | 22.5 | 3077 (1.448) | 3082 (0.310) | 7.6 | |
| 7-9 | 762 | 48 (157) | 0.010 | 13.5 | 693 (15.761) | 754 (1.602) | 116.2 | |
| 10+ | 442 | 87 (589) | 0.018 | 8.5 | 222 (37.135) | 405 (11.143) | 301.0 | |
| Mouse ONT.tr | 1-3 | 18029 | 12 (131) | 0.003 | 74.7 | 18029 (0.020) | 18029 (0.017) | 1.5 |
| 4-6 | 3028 | 32 (187) | 0.007 | 21.7 | 3012 (1.693) | 3022 (0.426) | 7.1 | |
| 7-9 | 803 | 50 (419) | 0.011 | 13.5 | 727 (14.958) | 794 (2.087) | 76.2 | |
| 10+ | 476 | 89 (549) | 0.021 | 8.8 | 251 (32.631) | 423 (8.422) | 183.2 | |
| Mouse PacBio | 1-3 | 14256 | 14 (382) | 0.003 | 74.4 | 14256 (0.034) | 14256 (0.025) | 1.7 |
| 4-6 | 1376 | 33 (187) | 0.007 | 22.1 | 1356 (2.622) | 1372 (1.189) | 12.3 | |
| 7-9 | 182 | 49 (159) | 0.010 | 14.0 | 146 (37.428) | 174 (6.627) | 137.8 | |
| 10+ | 63 | 109 (1178) | 0.015 | 10.3 | 15 (110.413) | 46 (17.775) | 370.6 | |
References
- [1] Isoseqsim. URL: https://github.com/yunhaowang/IsoSeqSim [cited April 23, 2025].
- [2] Bashar Ahmed, Siddharth Singh Rana, Shahbaz Khan, et al. Evaluating Optimal Safe Flows Decomposition for RNA Assembly. arXiv preprint arXiv:2409.13279, 2024.
- [3] Alfred V Aho and Jeffrey D Ullman. The theory of parsing, translation, and compiling, volume 1. Prentice-Hall Englewood Cliffs, NJ, 1973.
- [4] Ravindra K Ahuja, Thomas L Magnanti, James B Orlin, et al. Network flows: theory, algorithms, and applications, volume 1. Prentice hall Englewood Cliffs, NJ, 1993.
- [5] Frances E Allen. Control flow analysis. ACM Sigplan Notices, 5(7):1–19, 1970. doi:10.1145/800028.808479.
- [6] Stephen Alstrup, Dov Harel, Peter W Lauridsen, and Mikkel Thorup. Dominators in linear time. SIAM Journal on Computing, 28(6):2117–2132, 1999. doi:10.1137/S0097539797317263.
- [7] Adam L Buchsbaum, Loukas Georgiadis, Haim Kaplan, Anne Rogers, Robert E Tarjan, and Jeffery R Westbrook. Linear-time algorithms for dominators and other path-evaluation problems. SIAM Journal on Computing, 38(4):1533–1573, 2008. doi:10.1137/070693217.
- [8] Benjamin Merlin Bumpus, Bart M. P. Jansen, and Jari J. H. de Kroon. Search-space reduction via essential vertices. In Shiri Chechik, Gonzalo Navarro, Eva Rotenberg, and Grzegorz Herman, editors, 30th Annual European Symposium on Algorithms, ESA 2022, September 5-9, 2022, Berlin/Potsdam, Germany, volume 244 of LIPIcs, pages 30:1–30:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2022. doi:10.4230/LIPICS.ESA.2022.30.
- [9] Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian Schmidt, Alexandru I. Tomescu, and Elia C. Zirondelli. A simplified algorithm computing all s-t bridges and articulation points. Discrete Applied Mathematics, 305:103–108, 2021. doi:10.1016/j.dam.2021.08.026.
- [10] Massimo Cairo, Romeo Rizzi, Alexandru I. Tomescu, and Elia C. Zirondelli. Genome Assembly, from Practice to Theory: Safe, Complete and Linear-Time. In Nikhil Bansal, Emanuela Merelli, and James Worrell, editors, 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021), volume 198 of Leibniz International Proceedings in Informatics (LIPIcs), pages 43:1–43:18, Dagstuhl, Germany, 2021. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. doi:10.4230/LIPIcs.ICALP.2021.43.
- [11] Massimo Cairo, Romeo Rizzi, Alexandru I. Tomescu, and Elia C. Zirondelli. Genome assembly, from practice to theory: Safe, complete and Linear-Time. ACM Trans. Algorithms, 20(1):4:1–4:26, 2024. doi:10.1145/3632176.
- [12] Stefan Canzar, Sandro Andreotti, David Weese, Knut Reinert, and Gunnar W Klau. CIDANE: comprehensive isoform discovery and abundance estimation. Genome Biology, 17:1–18, 2016.
- [13] Rayan Chikhi, Antoine Limasset, and Paul Medvedev. Compacting de bruijn graphs from sequencing data quickly and in low memory. Bioinformatics, 32(12):i201–i208, 2016.
- [14] Keith D Cooper, Timothy J Harvey, and Ken Kennedy. A simple, fast dominance algorithm. Software Practice & Experience, 4(1-10):1–8, 2001.
- [15] Fernando H C Dias, Manuel Cáceres, Lucia Williams, Brendan Mumey, and Alexandru I Tomescu. A safety framework for flow decomposition problems via integer linear programming. Bioinformatics, 39(11):btad640, October 2023. doi:10.1093/bioinformatics/btad640.
- [16] Fernando H. C. Dias and Alexandru I. Tomescu. Accurate flow decomposition via robust integer linear programming. IEEE/ACM Transactions on Computational Biology and Bioinformatics, pages 1–14, 2024. doi:10.1109/TCBB.2024.3433523.
- [17] Fernando HC Dias, Lucia Williams, Brendan Mumey, and Alexandru I Tomescu. Fast, flexible, and exact minimum flow decompositions via ILP. In RECOMB 2022 – International Conference on Research in Computational Molecular Biology, pages 230–245. Springer, 2022.
- [18] Michael R. Fellows, Lars Jaffke, Aliz Izabella Király, Frances A. Rosamond, and Mathias Weller. What is known about vertex cover kernelization? CoRR, abs/1811.09429, 2018. arXiv:1811.09429.
- [19] Jianxing Feng, Wei Li, and Tao Jiang. Inference of isoforms from short sequence reads. Journal of Computational Biology, 18(3):305–321, 2011. doi:10.1089/CMB.2010.0243.
- [20] Wojciech Fraczak, Loukas Georgiadis, Andrew Miller, and Robert E Tarjan. Finding dominators via disjoint set union. Journal of Discrete Algorithms, 23:2–20, 2013. doi:10.1016/J.JDA.2013.10.003.
- [21] Thomas Gatter and Peter F Stadler. Ryūtō: network-flow based transcriptome reconstruction. BMC bioinformatics, 20:1–14, 2019. doi:10.1186/S12859-019-2786-5.
- [22] Loukas Georgiadis, Luigi Laura, Nikos Parotsidis, and Robert E Tarjan. Loop nesting forests, dominators, and applications. In Experimental Algorithms: 13th International Symposium, SEA 2014, Copenhagen, Denmark, June 29–July 1, 2014. Proceedings 13, pages 174–186. Springer, 2014.
- [23] Loukas Georgiadis and Nikos Parotsidis. Dominators in directed graphs: a survey of recent results, applications, and open problems. Proc. of ISCIM, 13:15–20, 2013.
- [24] Loukas Georgiadis, Robert Tarjan, and Renato Werneck. Finding dominators in practice. Journal of Graph Algorithms and Applications, 10(1):69–94, 2006. doi:10.7155/JGAA.00119.
- [25] Andreas Grigorjew, Fernando H. C. Dias, Andrea Cracco, Romeo Rizzi, and Alexandru I. Tomescu. Accelerating ILP solvers for minimum flow decompositions through search space and dimensionality reductions. In Leo Liberti, editor, 22nd International Symposium on Experimental Algorithms, SEA 2024, July 23-26, 2024, Vienna, Austria, volume 301 of LIPIcs, pages 14:1–14:19. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2024. doi:10.4230/LIPICS.SEA.2024.14.
- [26] Jiong Guo and Rolf Niedermeier. Invitation to data reduction and problem kernelization. SIGACT News, 38(1):31–45, 2007. doi:10.1145/1233481.1233493.
- [27] Rajiv Gupta. Generalized dominators and post-dominators. In Proceedings of the 19th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pages 246–257, 1992. doi:10.1145/143165.143216.
- [28] Saber Hafezqorani, Chen Yang, Theodora Lo, Ka Ming Nip, René L Warren, and Inanc Birol. Trans-NanoSim characterizes and simulates nanopore RNA-sequencing data. GigaScience, 9(6):giaa061, June 2020. doi:10.1093/gigascience/giaa061.
- [29] Bart M. P. Jansen and Ruben F. A. Verhaegh. Search-space reduction via essential vertices revisited: Vertex multicut and cograph deletion. In Hans L. Bodlaender, editor, 19th Scandinavian Symposium and Workshops on Algorithm Theory, SWAT 2024, June 12-14, 2024, Helsinki, Finland, volume 294 of LIPIcs, pages 28:1–28:17. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2024. doi:10.4230/LIPICS.SWAT.2024.28.
- [30] Shahbaz Khan, Milla Kortelainen, Manuel Cáceres, Lucia Williams, and Alexandru I. Tomescu. Improving RNA assembly via safety and completeness in flow decompositions. J. Comput. Biol., 29(12):1270–1287, 2022. doi:10.1089/CMB.2022.0261.
- [31] Shahbaz Khan and Alexandru I. Tomescu. Optimizing Safe Flow Decompositions in DAGs. In Shiri Chechik, Gonzalo Navarro, Eva Rotenberg, and Grzegorz Herman, editors, 30th Annual European Symposium on Algorithms (ESA 2022), volume 244 of Leibniz International Proceedings in Informatics (LIPIcs), pages 72:1–72:17, Dagstuhl, Germany, 2022. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. doi:10.4230/LIPICS.ESA.2022.72.
- [32] Kyle Kloster, Philipp Kuinke, Michael P O’Brien, Felix Reidl, Fernando Sánchez Villaamil, Blair D Sullivan, and Andrew van der Poel. A practical FPT algorithm for flow decomposition and transcript assembly. In ALENEX 2018 – Proceedings of the Twentieth Workshop on Algorithm Engineering and Experiments, pages 75–86. SIAM, 2018. doi:10.1137/1.9781611975055.7.
- [33] Jingyi Jessica Li, Ci-Ren Jiang, James B Brown, Haiyan Huang, and Peter J Bickel. Sparse linear modeling of next-generation mRNA sequencing (RNA-Seq) data for isoform discovery and abundance estimation. Proceedings of the National Academy of Sciences, 108(50):19867–19872, 2011.
- [34] Wei Li, Jianxing Feng, and Tao Jiang. IsoLasso: a LASSO regression approach to RNA-Seq based transcriptome assembly. In RECOMB 2011 – International Conference on Research in Computational Molecular Biology, pages 168–188. Springer, 2011.
- [35] Yen-Yi Lin, Phuong Dao, Faraz Hach, Marzieh Bakhshi, Fan Mo, Anna Lapuk, Colin Collins, and S Cenk Sahinalp. CLIIQ: Accurate comparative detection and quantification of expressed isoforms in a population. In WABI 2012 – 12th International Workshop on Algorithms in Bioinformatics, pages 178–189. Springer, 2012.
- [36] Juntao Liu, Ting Yu, Zengchao Mu, and Guojun Li. TransLiG: a de novo transcriptome assembler that uses line graph iteration. Genome Biology, 20:1–9, 2019.
- [37] Daniel Lokshtanov, Neeldhara Misra, and Saket Saurabh. Kernelization - preprocessing with a guarantee. In Hans L. Bodlaender, Rod Downey, Fedor V. Fomin, and Dániel Marx, editors, The Multivariate Algorithmic Revolution and Beyond - Essays Dedicated to Michael R. Fellows on the Occasion of His 60th Birthday, volume 7370 of Lecture Notes in Computer Science, pages 129–161. Springer, 2012. doi:10.1007/978-3-642-30891-8_10.
- [38] Aziz M Mezlini, Eric JM Smith, Marc Fiume, Orion Buske, Gleb L Savich, Sohrab Shah, Sam Aparicio, Derek Y Chiang, Anna Goldenberg, and Michael Brudno. iReckon: simultaneous isoform discovery and abundance estimation from RNA-seq data. Genome Research, 23(3):519–529, 2013.
- [39] Ralf Möhring. Algorithmic Aspects of Comparability Graphs and Interval Graphs. In Ivan Rival, editor, Graphs and Order: the role of graphs in the theory of ordered sets and its applications. D. Reidel Publishing Company, 1984.
- [40] Nidia Obscura Acosta, Veli Mäkinen, and Alexandru I Tomescu. A safe and complete algorithm for metagenomic assembly. Algorithms for Molecular Biology, 13:1–12, 2018. doi:10.1186/S13015-018-0122-7.
- [41] Renata Ochranová. Finding dominators. In Foundations of Computation Theory: Proceedings of the 1983 International FCT-Conference Borgholm, Sweden, August 21–27, 1983 4, pages 328–334. Springer, 1983. doi:10.1007/3-540-12689-9_115.
- [42] Rob Patro, Geet Duggal, Michael I Love, Rafael A Irizarry, and Carl Kingsford. Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods, 14(4):417–419, 2017. doi:10.1038/nmeth.4197.
- [43] Andrey Prjibelski. IsoQuant graphs from PacBio and ONT Mouse sequencing data, October 2024. doi:10.5281/zenodo.13987688.
- [44] Andrey D. Prjibelski, Alla Mikheenko, Anoushka Joglekar, Alexander Smetanin, Julien Jarroux, Alla L. Lapidus, and Hagen U. Tilgner. Accurate isoform discovery with IsoQuant using long reads. Nature Biotechnology, 41(7):915–918, 2023. doi:10.1038/s41587-022-01565-y.
- [45] Amatur Rahman and Paul Medvedev. Assembler artifacts include misassembly because of unsafe unitigs and underassembly because of bidirected graphs. Genome Research, 32(9):1746–1753, 2022.
- [46] Zhaleh Safikhani, Mehdi Sadeghi, Hamid Pezeshk, and Changiz Eslahchi. SSP: An interval integer linear programming for de novo transcriptome assembly and isoform discovery of RNA-seq reads. Genomics, 102(5-6):507–514, 2013.
- [47] Palash Sashittal, Chuanyi Zhang, Jian Peng, and Mohammed El-Kebir. Jumper enables discontinuous transcript assembly in coronaviruses. Nature Communications, 12(1):6728, 2021.
- [48] Sebastian Schmidt, Santeri Toivonen, Paul Medvedev, and Alexandru I. Tomescu. Applying the Safe-And-Complete Framework to Practical Genome Assembly. In Solon P. Pissis and Wing-Kin Sung, editors, 24th International Workshop on Algorithms in Bioinformatics (WABI 2024), volume 312 of Leibniz International Proceedings in Informatics (LIPIcs), pages 8:1–8:16, Dagstuhl, Germany, 2024. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. doi:10.4230/LIPICS.WABI.2024.8.
- [49] Mingfu Shao and Carl Kingsford. Theory and a heuristic for the minimum path flow decomposition problem. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 16(2):658–670, 2017.
- [50] Li Song, Sarven Sabunciyan, and Liliana Florea. CLASS2: accurate and efficient splice variant annotation from RNA-seq reads. Nucleic Acids Research, 44(10):e98–e98, 2016.
- [51] Alexandru I. Tomescu, Anna Kuosmanen, Romeo Rizzi, and Veli Mäkinen. A novel min-cost flow method for estimating transcript expression with RNA-Seq. BMC Bioinformatics, 14(5):S15, 2013. doi:10.1186/1471-2105-14-S5-S15.
- [52] Alexandru I. Tomescu and Paul Medvedev. Safe and complete contig assembly via omnitigs. In Mona Singh, editor, Research in Computational Molecular Biology, pages 152–163, Cham, 2016. Springer International Publishing. doi:10.1007/978-3-319-31957-5_11.
- [53] Benedicte Vatinlen, Fabrice Chauvet, Philippe Chrétienne, and Philippe Mahey. Simple bounds and greedy algorithms for decomposing a flow into a minimal set of paths. European Journal of Operational Research, 185(3):1390–1401, 2008. doi:10.1016/J.EJOR.2006.05.043.
- [54] Lucia Williams, Alexandru I. Tomescu, and Brendan Mumey. Flow decomposition with subpath constraints. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 20(1):360–370, 2023. doi:10.1109/TCBB.2022.3147697.
- [55] Zheng Xia, Jianguo Wen, Chung-Che Chang, and Xiaobo Zhou. NSMAP: a method for spliced isoforms identification and quantification from RNA-Seq. BMC Bioinformatics, 12:1–13, 2011.
- [56] Jin Zhao, Haodi Feng, Daming Zhu, and Yu Lin. MultiTrans: an algorithm for path extraction through Mixed Integer Linear Programming for transcriptome assembly. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 19(1):48–56, 2021. doi:10.1109/TCBB.2021.3083277.
- [57] Hongyu Zheng, Cong Ma, and Carl Kingsford. Deriving ranges of optimal estimated transcript expression due to nonidentifiability. Journal of Computational Biology, 29(2):121–139, 2022. Proceedings paper of RECOMB 2021. doi:10.1089/CMB.2021.0444.
