Abstract 1 Introduction 2 Preliminaries 3 Hardness 4 A Polynomial Time Algorithm for 𝚺-Representing Path when Each Symbol Labels at most Two Nodes 5 Distance from Disjoint Paths 6 An Approximation Algorithm 7 Conclusion References

Representing Paths in Digraphs

Riccardo Dondi ORCID Università degli Studi di Bergamo, Italy Alexandru Popa ORCID Department of Computer Science, University of Bucharest, Romania
Abstract

In this contribution we consider two combinatorial problems related to graph string matching, motivated by recent approaches in computational genomics. Given a DAG where each node is labeled by a symbol, the problems aim to find a path in the DAG whose nodes contain all (or the maximum number of) symbols of the alphabet. We introduce a decision problem, Σ-Representing Path, that asks whether there exists a path that contains all the symbols of the alphabet, and an optimization problem, called Maximum Representing Path, that asks for a path that contains the maximum number of symbols. We analyze the complexity of the problems, showing the NP-completeness of Σ-Representing Path when each symbol labels at most three nodes in the DAG, and showing the APX-hardness of Maximum Representing Path when each symbol labels at most two nodes in the DAG. We complement the first result by giving a polynomial-time algorithm for Σ-Representing Path when each symbol labels at most two nodes in the DAG. Then we investigate the parameterized complexity of the two problems when the DAG has a limited distance from a set of disjoint paths and we show that both problems are W[1]-hard for this parameter. We consider the approximation of Maximum Representing Path, giving an approximation algorithm of factor OPT, where OPT is the value of an optimal solution of the problem. We also show that Maximum Representing Path cannot be approximated within factor ee1α, for any constant α>0, unless NPDTIME(|V|O(loglog|V|)) (V is the set of nodes of the DAG).

Keywords and phrases:
Graph String Matching, Computational Complexity, Parameterized Complexity, Algorithms
Copyright and License:
[Uncaptioned image] © Riccardo Dondi and Alexandru Popa; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Parameterized complexity and exact algorithms
; Theory of computation Design and analysis of algorithms ; Theory of computation Graph algorithms analysis
Editors:
Paola Bonizzoni and Veli Mäkinen

1 Introduction

Graph string matching and approximate matching have been widely investigated due to their application in the context of pattern matching of hypertexts [1, 2, 20, 17] and panegenome analysis [19, 25, 5]. The problems consider a query string s and a directed graph D, whose nodes are labeled with symbols or strings. A path or a walk in D is associated with a string that is obtained by concatenating the symbols or the strings on the path (or walk) nodes. The problems ask whether there exists a path or a walk that matches, or approximately matches, the query string. In exact matching the two strings have to be identical, while in approximated matching edit operations may be applied on the query string or on the node labels, in order to obtain two identical strings, and such edit operations have to be minimized.

Both matching and approximate matching can be solved in polynomial time if the input graph is a Directed Acyclic Graph (DAG) [17]. When the input digraph admits cycles the exact matching is solvable in polynomial time  [1, 2, 20, 18, 13] and conditional lower bounds [8] show that improving the algorithms known in the literature is unlikely. The approximate matching is solvable in polynomial time when edit operations are applied only in the query string [13] and it is instead NP-hard when the edit operations may be applied on node labels [2], also for binary alphabet [13, 6].

A second research direction that has been recently investigated in sequence analysis, in the context of computational genomics, is the quest for subsequences of a given string that contain all symbols of the alphabet (and that have other specific properties)  [21, 7, 15, 3, 16]. The property that each symbol of the alphabet is contained in a subsequence of a given string s is applied in [21] looking for a run subsequence of maximum length, where a run subsequence contains at least one substring for each symbol of the alphabet. Another approach is considered in [15, 16] and it looks whether there exists a subsequence of s that consists of substrings of repeated symbols that have length at least two, with the constraint that the subsequence contains each symbol of the alphabet. Both problems are NP-hard [21, 16] and results on their tractability have been given [21, 7, 3, 15, 16].

In this contribution, we introduce new problems that aim to integrate the two aforementioned approaches. On the one hand we consider a DAG having nodes labeled by symbols over an alphabet Σ, since this allows us to represent variants of a sequence as in graph string matching. On the other hand we look for a path that contains all (or the maximum number of) symbols of Σ, in a similar way to the second approach. If there exists a path in the DAG that contains all the symbols of the alphabet, we call this a representing path. We introduce a decision problem, called Σ-Representing Path, where we ask whether there exists a representing path in a DAG D. Since in some cases a representing path may not exist, we consider an optimization variant, called Maximum Representing Path, where we look for a path in D that represents the maximum number of symbols in Σ, that is the string associated with the path contains the maximum number of symbols of Σ. Next, we summarize our results.

We start in Section 2 by presenting some concepts and by formally defining the two problems, Σ-Representing Path and Maximum Representing Path. Then we investigate the complexity of the problems and we prove in Section 3 that Σ-Representing Path is NP-complete even when each symbol labels at most three nodes of the input DAG, and Maximum Representing Path is APX-hard when each symbol labels at most two nodes of the DAG. Moreover, in both aforementioned cases, the input DAG has also the maximum degree bounded by three. In Section 3 we prove also a lower bound on the approximation: Maximum Representing Path cannot be approximated with factor ee1α, for any constant α>0, unless NPDTIME(|V|O(loglog|V|)), where V is the set of nodes of the input DAG. We complement the first hardness result by showing in Section 4 that Σ-Representing Path can be solved in polynomial time when each symbol labels at most two nodes of D.

Then we study in Section 5 how the complexity of the two problems is influenced by the structure of D. Observe that if D consists of a set of disjoint paths (except for the source and the target nodes of D), then the two problems are easy to solve by inspecting each path independently. We show that Σ-Representing Path and Maximum Representing Path are W[1]-hard for parameter distance to disjoint paths. Finally, in Section 6 we present an approximation algorithm for the Maximum Representing Path problem of factor OPT, where OPT is the value of an optimal solution of Maximum Representing Path. We conclude the paper in Section 7 pointing out some future directions. Some of the proofs are not included due to page limit (marked with ()).

2 Preliminaries

We introduce some notation. For a natural number n, we denote [n]={1,,n}. A Directed Acyclic Graph (DAG) D=(V,A) is a directed graph consisting of a set V of nodes, a set A={(u,v):u,vV} of arcs, such that there is no directed cycle in D. Given a finite alphabet Σ and a DAG D=(V,A), we define a labeling function λ that associates a symbol with each node of D, that is, λ:VΣ.

A path p in D is a sequence vp,1vp,z (vp,i represents the i-th node of path p) of adjacent distinct nodes of V, that is (vp,i,vp,i+1)A for i[z1] and vp,ivp,j, for each i,j[z] with ij. The path is called a vp,1vp,z path, since it starts from vp,1 and ends in vp,z. The set Σ(p) of symbols represented by p is defined as follows:

Σ(p)={cΣ:λ(vp,i)=c, for some i[z]}.

A path p in D (labeled by λ) is Σ-representing if Σ(p)=Σ (an example is given in Fig. 1); we denote the nodes of the path as V(p). When a path p contains a node labeled by symbol cΣ, we say that p covers c. We denote the length of p (the number of nodes in p) by |p|. A longest path in a graph is a path having maximum length. Finding a longest path in a graph is an NP-hard problem and even hard to approximate [14], but in DAGs it can be computed in linear time [22]. Now, we define the first problem we study in this paper (we assume that the labeling λ is surjective, so each symbol in Σ is associated with at least one node of the input graph D).

Problem 1.

(Σ-Representing Path)
Input: A DAG D=(V,A), with a source node sV, a target node tV, a labeling λ:VΣ.
Output: Is there an st-path in D that is Σ-representing?

Since there are cases where a DAG does not contain an st-path in D that is Σ-representing (see the example in Fig. 1), we consider a second problem, where we look for an st-path in D that contains the maximum number of distinct symbols of Σ.

Problem 2.

(Maximum Representing Path)
Input: A DAG D=(V,A), with a source node sV, a target node tV, a labeling λ:VΣ.
Output: An st-path in D that covers the maximum number of symbols in Σ.

Figure 1: A DAG having labeled nodes (labels are inside each node, node names outside). If all the arcs (including the dashed arc between v3 and v4) are in D there exists an st path that is Σ-representing: sv1v3v4t. If the dashed arc is not in the graph, there is no Σ-representing path in D.

Given a DAG D=(V,A) and a node vV, the degree of v is the number of arcs incoming to v or outgoing from v, that is

deg(v)=|{(u,v)A}{(v,u)A}|.

The transitive closure of D is a graph D=(V,A) where A is defined as follows:

A={(u,v):u,vV,uv and there is a path from u to v in D}.

Given two DAGs D1=(V1,A1) and D2=(V2,A2), with V1V2=, having source nodes s1, s2, respectively, and target nodes t1, t2, respectively, the concatenation of D1 and D2 is a DAG D=(V1V2,A1A2{(t1,s2)}). Informally D is obtained by adding an arc from t1 to s2. The definition can be easily extended to more than two DAGs, specifying the order of concatenation. Note that the definition of concatenation holds also for paths.

3 Hardness

In this section we present hardness results for the Σ-Representing Path and Maximum Representing Path problems. In Subsection 3.1 we show the NP-completeness of the Σ-Representing Path problem even if when each symbol labels at most three nodes of D, whose degree is bounded by three. In Subsection 3.2 we show two hardness of approximation results for the Maximum Representing Path problem: first we show that the problem is APX-hard even if each symbol labels at most three nodes of D, whose degree is bounded by three; then we show a stronger result on arbitrary instances, namely that Maximum Representing Path cannot be approximated within a factor of ee1α, for any constant α>0, unless NPDTIME(|V|O(loglog|V|)).

3.1 NP-completeness

In this subsection we prove that the Σ-Representing Path problem is NP-complete via a reduction from 3-SAT, a classical NP-complete problem (see, e.g., [11]). Our reduction holds even on restricted instances of the problems when each symbol labels at most three nodes of the input DAG D and each node has degree bounded by three.

We recall that 3-SAT, given a formula ϕ in conjunctive normal form over a set of variables X, where each clause is a disjunction of three literals (a variable or its negation), asks for an assignment of the variable set X so that each clause of ϕ is satisfied.

Theorem 1.

The Σ-Representing Path problem is NP-complete, even in the case when (1) each symbol labels at most three nodes of the DAG, that is, cΣ,|vV:λ(v)=c|3, and (2) the degree of each node is bounded by three.

Proof.

First, observe that the Σ-Representing Path problem is in the class NP since, given an st path, we can verify in polynomial time if the path contains each symbol of Σ.

Given an instance of the 3-SAT problem, that is a Boolean formula ϕ with n variables and m clauses, denote the n variables of the formula ϕ as x1,x2,,xn and the m clauses as C1,C2,,Cm. We construct an instance of Σ-Representing Path, that is, a labeled DAG (D=(V,A),λ), as follows (see an example in Fig. 2). First, assume without loss of generality that no variable xi appears only negated or nonnegated – otherwise said, every variable has at least one negated and one nonnegated occurrence in the formula ϕ. If there exists such a variable, we can simply assign it to True (if it is nonnegated) or to False (if it is negated) and remove all the clauses that contain the respective variable.

  • The alphabet. For each clause Ci, i[m], we add a corresponding symbol ci to the alphabet Σ. Moreover, for each variable xi, i[n], we add symbols xi,1, xi,2 to Σ.

  • The node set. For each variable xi, i[n], we add the following nodes in the node set. Assume that xi appears in k clauses, k11 times nonnegated and k21 times negated. Thus, k=k1+k2.

    We first add to the node set two nodes: si,ti. Then, we add nodes ui1,ui2,,uik1 and nodes vi1,vi2,,vik2. The source node node of D is s=s1 and the target node of D is t=tn. As for the labeling, λ(si)=xi,1, λ(ti)=xi,2; assuming xi, i[k1], has the j-th nonnegated appearance in clause Ch then λ(uij)=ch; assuming xi, i[k2], has the j-th negated appearance in clause Ch then λ(vij)=ch.

  • The arc set. For each i[n] we add the following arcs:

    • Arcs outgoing from si: (si,ui1), (si,vi1)

    • Arcs incoming in ti: (uik1,ti), (vik2,ti)

    • Arcs to connect two subgraphs associated with xi and xi+1, where i[n1]: (ti,si+1)

    • Arcs (uij,uij+1), with j[k11], and (vij,vij+1), with j[k21]

We show now that our reduction is correct. More precisely, we show that (1) each symbol labels at most three nodes of D, (2) the degree of D is bounded by three and (3) the formula ϕ has a satisfying assignment, if and only if, there is an st-path in D that contains every symbol of Σ.

Figure 2: A sketch of the DAG computed by the reduction from 3-SAT. We consider x1 to appear nonnegated in clauses C1, C2 and negated in clause C3; x2 to appear nonnegated in clause C1 and negated in clauses C3 and C4.

Consider (1) and note that each symbol xi,1, xi,2, i[n], labels exactly one node of D. Moreover, since each clause consists of three literals, then each of the symbols ci, i[m], labels exactly three nodes of D.

Consider (2). The nodes of D having degree larger than two, are possibly si and ti, i[n]. Each si has indegree at most one and outdegree two, since the arcs outgoing from si are (si,ui1) and (si,vi1), thus deg(si)3. Each ti has outdegree at most one and indegree two, since the arcs incoming to ti are (uik1,ti) and (vik2,vi1). Hence deg(ti)3.

Now, we prove (3). First, given a satisfying assignment of ϕ, we select the path in D as follows. For each i[n], if xi is True, then in the subgraph corresponding to the variable xi, we take the path siui1ui2uik1ti thus, covering the symbols xi,1, xi,2, and c1,c2,,ck1, associated with clauses where xi appears nonnegated; otherwise, if xi is assigned to False, we choose the path sivi1vi2vik2ti, thus covering the labels xi,1, xi,2, and c1,c2,,ck2, associated with clauses where xi appears negated. Between two subgraphs the path contains arc (ti,si+1), i[n1].

Observe that we take the path that covers precisely the symbols corresponding to the clauses satisfied by xi. Observe that the symbols xi,1, xi,2 that are associated with variables are always covered. Since the formula ϕ is satisfied by the assignment, the path obtained by concatenating the subpaths also covers all the symbols in the alphabet.

Conversely, assume that we are given an st-path in D that covers all the symbols in Σ. If the st-path contains the subpath siui1ui2uik1ti, i[n], then we set xi to True, otherwise we set xi to False. The crucial observation that was also mentioned in the first part of the reduction is that the symbols covered in the subgraph associated with variable xi correspond to the clauses satisfied by the assignment to xi. Since all the labels are covered, the assignment produced satisfies all the clauses.

3.2 Hardness of Approximation

The previous result (Theorem 1) can be extended to Maximum Representing Path. By reducing from Max 2-SAT, which is known to be APX-hard [12], we show that Maximum Representing Path is APX-hard even when each symbol labels at most two nodes of the input DAG and the degree of each node is bounded by three.

Corollary 2 ().

The Maximum Representing Path problem is APX-hard, even in the case when (1) each symbol labels at most two nodes of the input DAG, that is, cΣ,|vV:λ(v)=c|2, and (2) the degree of each node is bounded by three.

We now present the second inapproximability result, namely an approximation preserving reduction from the Max k-Cover problem (defined next), that cannot be approximated within factor ee1ε, for any constant ε>0, unless NPDTIME(|U|O(loglog|U|)).

Problem 3.

(Max k-Cover)
Input: A collection of n sets 𝒮={S1,S2,Sn} over a universe U and an integer k.
Output: k sets from 𝒮 whose union have the largest cardinality.

Now, we present our second inapproximability result.

Theorem 3 ().

The Maximum Representing Path problem cannot be approximated within a factor of ee1α, for any constant α>0, unless NPDTIME(|V|O(loglog|V|)).

Proof.

(Sketch) We present an approximation preserving reduction from Max k-Cover to Maximum Representing Path. Given an input instance (𝒮,k) of Max k-Cover, we construct the following instance of Maximum Representing Path. First, the set of nodes of the DAG D is defined as follows:

  • We add k+1 nodes v1,v2,vk+1, where s=v1 and t=vk+1, labeled with the same symbol a.

  • For each element xU, define a path p(x) of length |U|, and each node of p(x) is labeled with a distinct symbol xi, i[|U|].

  • Between two nodes vj, vj+1, j[k], add |𝒮| paths, each one associated with a set Si (denoted by p(Si)); each path associated with Si consists of the concatenation of paths p(xi,1),,p(xi,z), where xi,1,,xi,z are the elements in set Si.

Given a solution S1,,Sk of Max k-Cover on instance (U,𝒮) such that i=1k|Si|=h, we can compute in polynomial time a solution of Maximum Representing Path on instance D that covers |U|h+1 symbols, by defining, between each two nodes vj and vj+1, j[k], path p(Sj). For the other direction, given a solution p of Maximum Representing Path on instance D that covers |U|h+1 symbols, we can compute in polynomial time a solution S1,,Sk of Max k-Cover on instance (U,𝒮), by defining Sj as the set corresponding to the path between nodes vj and vj+1, j[k].

4 A Polynomial Time Algorithm for 𝚺-Representing Path when Each Symbol Labels at most Two Nodes

In this section we present a polynomial time exact algorithm (Algorithm 1) for the Σ-Representing Path problem when each symbol in Σ labels at most two nodes of the input DAG D. We first introduce the notion of compatible nodes.

Definition 4.

We say that two nodes u1,u2V are compatible if there exists an st-path that contains both nodes, that is either a path su1u2t or su2u1t. Otherwise, the nodes are called incompatible.

Algorithm 1 reduces the input instance of the Σ-Representing Path problem when each symbol labels at most two nodes to an instance of 2-SAT, which is know to be polynomial time solvable [24, 4].

Algorithm 1 A polynomial time exact algorithm for the Σ-Representing Path problem where each symbol labels at most two nodes of D.
Theorem 5 ().

The Σ-Representing Path problem is solvable in O(|V||A|) time in the case when each symbol labels at most two nodes of D, that is, cΣ,|uV:λ(u)=c|2.

5 Distance from Disjoint Paths

The Σ-Representing Path and the Maximum Representing Path problems are trivial if the DAG, after the removal of s and t, consists of a set of disjoint paths. Here we consider the two problems when parameterized by distance to disjoint paths and we show that Σ-Representing Path and Maximum Representing Path are W[1]-hard for this parameter. The distance to disjoint paths is defined as the minimum number of nodes to be removed from a graph (in this case D) such that the resulting graph consists of a set of node disjoint paths (except possibly for s and t). We prove the hardness results by giving a parameterized reduction from the Multicolored Clique problem, defined as follows.

Problem 4.

(Multicolored Clique)
Input: An undirected graph G=(W,E), whose nodes are partitioned into color classes W=W1W2Wk and each edge in E connects two nodes that are in a different color class.
Output: Is there a clique in G that contains exactly one node for each color class?

A clique in G that contains exactly one node for each color class is called a multicolored clique. Given an instance G=(W,E) of the Multicolored Clique problem (recall that the partition of W in color classes is given in input), we construct a corresponding instance (D,λ) of Σ-Representing Path. The DAG D consists of three subgraphs, that share some nodes (see Fig. 3 for a representation of the structure of D): (1) a DAG D1 with source node s1=s and target node t1, (2) A DAG D2 with source node s2=t1 and target node t2, and (3) A DAG D3 with source node s3=t2 and target node t3=t.

Figure 3: The structure of graph D, computed by the reduction from Multicolored Clique to Σ-Representing Path (dashed arcs represent DAGs between two nodes).

We first give an informal description of the reduction and then we present it formally. D1 and D2 encode a multicolored clique (symbols not covered by a path in D1 and D2 represent nodes and edges of a multicolored clique), D3 enables to cover all the symbols not selected in D1 and D2 (assuming a path selected in D1 and D2 contains all the symbols except those encoding a multicolored clique).

Figure 4: A sketch of the DAG D11 (we include also s, the souruce of D, and s12) computed by the reduction from Multicolored Clique to Σ-Representing Path.

Consider G=(W,E), where W={w1,1,,wk,|Wk|}, and wi,h, i[k] and h[|Wi|], represents the h-th node of color class Wi (we assume that the nodes in each color class have some ordering). We start by giving an informal description of D1, D2 and D3.

The DAG D1 consists of k concatenated DAGs D11, …, D1k (see Fig. 4 for a sketch of D11), where each D1i, i[k], is associated with a color class Wi. Each D1i, i[k], encodes the selection of exactly one node of Wi and it contains a path, denoted by D1(wi,h), for each node wi,hWi, with h[|Wi|]. D1i has a source node s1i and a target node t1i. Node s1i has arcs to the first node of each path D1(wi,h). Each D1(wi,h) contains the symbols associated with nodes in Wi, except for the symbols associated with wi,h.

The DAG D2 consists of k(k1) concatenated DAGs D21,2, …, D2k,k1 (see Fig. 5 for a sketch of D21,2), where each D2i,j, i,j[k], ij, is associated with edges connecting nodes of color class Wi and color class Wj. DAG D2i,j has a source node s2i,j and a target node t2i,j. For each edge {wi,h,wj,q}E, h[|Wi|] and q[|Wj|], D2i,j contains one path, denoted by D2(wi,h,wj,q) whose nodes are labeled by one symbol associated with wi,h, and set Li,j (encoding edges between nodes of Wi and of Wj) except for the symbol associated with edge {wi,h,wj,q}.

Figure 5: The first part of subgraph D21,2 computed by the reduction from Multicolored Clique. We assume that w1,1 is adjacent to w2,1, w2,j and w2,|W2|.

The DAG D3 (see Fig. 6) consists of k concatenated DAGs D31, …, D3k, each one associated with a color class Wi, i[k]. Each D3i, i[k], has a source node s3i and a target node t3i. Node s3i has arcs to |Wi| paths, one path D3(wi,h) for each node wi,hWi, h[|Wi|]. The nodes in D3(wi,h) have labels that encode wi,h and the edges incident in wi,h. The idea is that the symbols not covered by a path in D1 and D2 can be covered by a path in D3 only if the path in D1 and D2 contains all the symbols except those encoding edges of a multicolored clique in G and one symbol for each node in the multicolored clique.

Figure 6: The first subgraph of the DAG D3 computed by the reduction from Multicolored Clique, associated with the nodes in W1. We assume that w1,j is adjacent to w2,1, w3,2, thus D3(w1,j) is a path whose nodes have labels l2,1,1,j,l3,2,1,j (these labels are represented in the box of D3(w1,j)). We include also s3, the source of D3, and node s32.

Now, we present the details of the reduction. We start by defining the alphabet Σ and some subsets of Σ.

Σ= {ai:WiW,i[k]}{bi,h,q:wi,hWi,i[k],h[|Wi|],q[k]}
{li,h,j,q:i,j[k],ij,h[|Wi|],q[|Wj|]wi,hWiwj,qWj{wi,h,wj,q}E}.

We define sets B(wi,h), wi,hWi, and Bi, i[k], of symbols:

B(wi,h)=q[k]bi,h,q,Bi=h[|Wi|]B(wi,h).

Given i,j[k], with ij, we denote the subset Li,j of symbols as follows:

Li,j={li,h,j,q:h[|Wi|],q[|Wj|]wi,hWiwj,qWj{wi,h,wj,q}E}.

The set Li, i[k], is defined as follows:

Li=j[k]jiLi,j.

Note that Li,j and Lj,i are different subsets, in particular that li,h,j,qlj,q,i,h.

Now, we define the DAG D1. D1 has a source node s1=s and a target node t1, both labeled by a1. D1 is obtained by concatenating DAGs D1i, i[k], each one associated with a color class. Each D1i has a source node s1i and a target node t1i. For each i[k1], there exists an arc from t1i to s1i+1 (this defines the concatenation of subgraphs D1i). Now, we define each subgraph D1i, i[k]:

  • Each subgraph D1i has a source node s1i and a target node t1i, labeled by ai.

  • Node s1i is connected to |Wi| disjoint paths D1(wi,h), each one associated with a node wi,hWi.

  • Each path D1(wi,h) is labeled by symbols

    Biq[k]{bi,h,q}

Finally, there is an arc from node s1 to s11 and an arc from node t1k to node t1.

Next, we define the DAG D2. D2 has a source node s2=t1 and a target node t2, both labeled by symbol a2. D2 is obtained by concatenating DAGs D2i,j, with i,j[k] and ij, each one associated with Wi and Wj. Each D2i,j has a source node s2i,j and a target node t2i,j. We assume that DAGs D2i,j are concatenated as follows: for i,j[k] with ij, if j<k then there is an arc from the target t2i,j of D2i,j to the source s2i,j+1 of D2i,j+1, and if j=k (and i<k) there is an arc from the source s2i,j of D2i,j to the target t2i+1,1 of D2i+1,1.

Now, we define each subgraph D2i,j, i,j[k] and ij:

  • Each subgraph D2i,j has a source node s2i,j and a target node t2i,j, labeled by ai.

  • Node s2i,j is connected to paths D2(wi,h,wj,q), each one associated with a node wi,hWi and a node wj,qWj, such that {wi,h,wj,q}E.

  • Each path D2(wi,h,wj,q) is labeled by the set of symbols (recall that ij, hence symbol bi,h,i does not label any node on each path D2(wi,h,wj,q)):

    (Li,j{bi,h,j}){li,h,j,q}

Finally, there is an arc from node s2 to s21,2 and an arc from node t2k,k1 to node t2.

Now, we define the DAG D3. D3 has source s3 and target t3, both labeled by a3. D3 is obtained by concatenating DAGs D3i, i[k], each one associated with a color class. Each subgraph D3i, i[k], is defined as follows:

  • Each subgraph D3i has a source node s3i and a target node t3i, labeled by ai.

  • Between nodes s3i and t3i there are |Wi| disjoint paths, D3(wi,h), h[|Wi|], each one associated with a node wi,hWi.

  • The nodes in each path D3(wi,h) are labeled by the following set of symbols

    {wi,h,wj,q}E{lj,q,i,h}{bi,h,i}.

Note that the path D3(wi,h) contains nodes having labels {lj,q,i,h} that encode edges of E incident in wi,h. Note also that these nodes, for each edge {wi,h,wj,q}E, have labels {lj,q,i,h} not {li,h,j,q}.

Finally, there is an arc from t3i to s3i+1, with i[k1], an arc from s3 (the source of D3) to s31 and an arc from t3k to t3.

Having defined D and its labeling, we start to prove some properties of graph D1 and D2.

Lemma 6.

Consider an instance G of Multicolored Clique and a corresponding instance (D,λ) of Σ-Representing Path. Given a path p from s1 to t1 in D1, then p covers the following set of symbols:

  1. 1.

    Each symbol ai, i[k]

  2. 2.

    A set B defined as follows:

    B=i[k]Bi{bi,h,q:such that p traverses path D1(wi,h),q[k]}.

Proof.

Let p be a path from s1 to t1 in D1. Nodes s1i and t1i, i[k], must be traversed by any path in D1, hence also by p, and they are labeled by ai. Hence point 1 holds.

We prove now point 2. Consider in particular a DAG D1i, i[k], and the subpath pi of p in D1i. By construction, pi traverses exactly one of D1(wi,h), with h[|Wi|], between s1i and t1i. Also note that by construction a subgraph of D1(wi,h) contains the symbols Bi, except for the symbols bi,h,q, q[k], that by construction do not label any node of D1(wi,h), thus point 2 holds.

Lemma 7.

Consider an instance G of Multicolored Clique and a corresponding instance (D,λ) of Σ-Representing Path. A path p in D that covers all the symbols in Σ contains a path p2 from s2 to t2 in D2 such that:

  1. 1.

    For each D1(wi,h) traversed by p in D1, p2 traverses a subgraph D2(wi,h,wj,q), for some j[k], q[|Wj|] and covers a set B′′=i,q[k],iq,h[|Wi|]{bi,h,q: such that p traverses path D1(wi,h)}

  2. 2.

    For each i[k], p2 covers a set LiLi of symbols, where Li=LiNi and Ni is defined as follows:

    Ni=h,j,q such that p2 traverses subgraph D2(wi,h,wj,q){li,h,j,q}.

Proof.

Let p be a path from s to t that covers all the symbols in Σ. First, consider point 1. By Lemma 6, the path p in D1 does not cover the set of symbols bi,h,q, h[|Wi|], q[k], such that D1(wi,h) is traversed by p. Each symbol bi,h,q, with i,q[k], iq and h[|Wi|], labels only nodes of D1i and D2, thus if it is not covered by p in D1i it must be covered in D2. It follows that for each D1(wi,h) traversed by p in D1, p2 traverses a subgraph D2(wi,h,wj,q). This implies that p2 covers

B′′=i,h,q,i,q[k],iq,h[|Wi|]{bi,h,q: such that p traverses path D1(wi,h)}

and point 1 is proven.

Now, we consider point 2. In each D2i,j,i,j[k] and ij, path p traverses exactly one subpath D2(wi,h,wj,q), for some h[|Wi|] and q[|Wj|], whose nodes have labels Li,j{li,h,j,q}. Then p in D2i covers a set LiLi of symbols, where Li=LiNi and

Ni=h,q with D2(wi,h,wj,q) traversed by p in D2i,j{li,h,j,q}

hence point 2 is proven. Based on Lemma 6 and Lemma 7, we can prove the main result of this section.

Lemma 8 ().

Consider an instance G of Multicolored Clique and a corresponding instance (D,λ) of Σ-Representing Path. Then, G contains a multicolored clique if and only if there exists a path in D that covers all the symbols in Σ.

D has distance from a set of disjoint paths bounded by 2k(k1)+4k, since by removing the source and target node of each D1i, with i[k], of each D2i,j, with i,j[k] and ij, and of each D3i, with i[k], we obtain a set of disjoint paths. Since Multicolored Clique is W[1]-hard when parameterized by k [10], we can prove the following theorem.

Theorem 9 ().

Σ-Representing Path is W[1]-hard when parameterized by distance to disjoint paths.

We extend the result of Theorem 9 to the Maximum Representing Path problem.

Corollary 10 ().

Maximum Representing Path is W[1]-hard when parameterized by distance to disjoint paths.

6 An Approximation Algorithm

In this section we present a polynomial time approximation algorithm for the Maximum Representing Path problem that achieves an approximation factor of OPT, where OPT is the number of distinct symbols in an optimal solution. Notice that, since OPT|Σ|, our algorithm is also a |Σ|-approximation. Informally, the algorithm is as follows. First, we create a compatibility DAG D=(V,A), that is essentially the transitive closure of the input DAG D (see Section 2 for the definition of transitive closure). Then, we consider a total order on Σ, e.g., standard alphabetic order, such that we have λ(u)<λ(v), for two nodes u and v if the label of u precedes the label of v based on this order. We create two subgraphs of D (that are implicitly DAGs), D1 and D2 as follows:

  1. 1.

    D1=(V,A1) where (v1,v2)A1 if and only if (v1,v2)A and λ(v1)<λ(v2)

  2. 2.

    D2=(V,A2) where (v1,v2)A2 if and only if (v1,v2)A and λ(v1)>λ(v2).

Observe that A1,A2A and that the set of nodes of D1 and D2 is V. We then compute p1, a longest path between s and t in D1, and p2, a longest path between s and t in D2. As pointed out in Section 2, a longest path in a DAG can be computed in linear time. The algorithm (formally presented in Algorithm 2) outputs the path pi, i{1,2}, that has a largest number of distinct symbols. Theorem 11 proves that it is indeed a OPT-approximation for Maximum Representing Path.

Algorithm 2 A OPT-approximation algorithm for Maximum Representing Path.
Theorem 11.

Algorithm 2 is a OPT-approximation for the Maximum Representing Path problem and requires O(|V||A|) time.

Proof.

First, we prove that Algorithm 2 computes a feasible solution of Maximum Representing Path, that is, the path p returned by the Algorithm 2 is an st-path in D. Consider the DAG D=(V,A). Since D is the transitive closure of D, it follows that for any path in D there exists a corresponding path in D, since for any arc (u,v)A, there is a path p(u,v) from u to v in D. Thus, given a path p in D, we can compute a corresponding path p in D by concatenating the paths p(u,v) associated with arcs (u,v) in p. Since D1 and D2 are subgraphs of D, in particular they have the same set of nodes and a subset of the arcs of D, any path in D1 or D2 corresponds to a path in D, hence also in D. Moreover, we add s (t, respectively) to the returned path if s (or t, respectively) is not part of the path p, and possibly a path from s to the first node of p (a path from the last node of p to t, respectively), hence the algorithm returns an st path in D.

We show now the approximation factor of Algorithm 2. Let po be an optimal solution of the Maximum Representing Path problem, that is, a path po such that |Σ(po)|=OPT is maximized. Let Vo={v1,v2,vOPT} be a subset of the nodes that appear on the path po in this order and have pairwise distinct labels, that is λ(vi)λ(vj),1i<jOPT.

According to the Erdös-Szekeres theorem [9], every sequence of z2+1 distinct integers contains a monotonic (increasing or decreasing) sequence of length z+1. Thus, the string associated with po contains a a monotonic (increasing or decreasing) sequence of length at least OPT, hence the path po contains kOPT nodes vi1,vi2,,vik, such that vix appears before viy in p, for x<y and x,y[k], and either λ(vi1)<λ(vi2)<<λ(vik) or λ(vi1)>λ(vi2)>>λ(vik). Notice that since po is a path in D, for any two nodes vi,vjpo such that i<j, we have (vi,vj)A. Thus, the path vi1,vi2,,vik is either a path in D1 or in D2, thus it has length not larger than that of p1 or of p2, respectively. Since the path p returned by Algorithm 2 is obtained by taking the nodes in one of {p1,p2} that covers the maximum number of symbols (and possibly adding other nodes), and each pi, i{1,2}, contains nodes with distinct labels, then |Σ(p)|kOPT. Thus, Algorithm 2 is a OPT-approximation algorithm for Maximum Representing Path.

We consider now the time complexity of Algorithm 2. Step 1 can be computed in O(|V||A|) time [23] and |A| contains O(|V|2) arcs. Step 2 and Step 3, can be computed in O(|V|+|A|) time by traversing D. Step 4 and 5 can be computed in O(|V|+|A1|) and O(|V|+|A2|) time [26], respectively, where O(|V|+|A1|) and O(|V|+|A2|) are bounded by O(|V|+|A|). Thus the overall time complexity is O(|V||A|+|V|2) and, since we can assume that no node is isolated, then |A||V|1, thus the overall time complexity is O(|V||A|).

7 Conclusion

In this contribution we have introduced two combinatorial problems (Σ-Representing Path and Maximum Representing Path) that ask to identify a path in a node labeled DAG that contains all (or a subset of maximum size) of the alphabet symbols. We have proved results on the computational complexity and parameterized complexity of the two problems, and we have studied the approximation of Maximum Representing Path.

One of the most interesting future directions is to further investigate the approximate complexity of the Maximum Representing Path problem: does it admit constant factor approximation algorithms? It is interesting to design approximation algorithms for some restrictions on the DAG structure, for example when the degree is bounded. It is also interesting to study other structural properties of the DAG that may lead to polynomial-time algorithms for both problems.

References

  • [1] Tatsuya Akutsu. A linear time pattern matching algorithm between a string and a tree. In Combinatorial Pattern Matching, 4th Annual Symposium, CPM 93, Padova, Italy, June 2-4, 1993, Proceedings, pages 1–10, 1993. doi:10.1007/BFb0029792.
  • [2] Amihood Amir, Moshe Lewenstein, and Noa Lewenstein. Pattern matching in hypertext. Journal of Algorithms, 35(1):82–99, 2000. doi:10.1006/jagm.1999.1063.
  • [3] Yuichi Asahiro, Hiroshi Eto, Mingyang Gong, Jesper Jansson, Guohui Lin, Eiji Miyano, Hirotaka Ono, and Shunichi Tanaka. Approximation algorithms for the longest run subsequence problem. In Laurent Bulteau and Zsuzsanna Lipták, editors, 34th Annual Symposium on Combinatorial Pattern Matching, CPM 2023, June 26-28, 2023, Marne-la-Vallée, France, volume 259 of LIPIcs, pages 2:1–2:12. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPICS.CPM.2023.2.
  • [4] Bengt Aspvall, Michael F. Plass, and Robert Endre Tarjan. A linear-time algorithm for testing the truth of certain quantified boolean formulas. Inf. Process. Lett., 8(3):121–123, 1979. doi:10.1016/0020-0190(79)90002-4.
  • [5] Jasmijn A. Baaijens, Paola Bonizzoni, Christina Boucher, Gianluca Della Vedova, Yuri Pirola, Raffaella Rizzi, and Jouni Sirén. Computational graph pangenomics: a tutorial on data structures and their applications. Nat. Comput., 21(1):81–108, 2022. doi:10.1007/S11047-022-09882-6.
  • [6] Riccardo Dondi, Giancarlo Mauri, and Italo Zoppis. On the complexity of approximately matching a string to a directed graph. Inf. Comput., 288:104748, 2022. doi:10.1016/J.IC.2021.104748.
  • [7] Riccardo Dondi and Florian Sikora. The longest run subsequence problem: Further complexity results. In Pawel Gawrychowski and Tatiana Starikovskaya, editors, 32nd Annual Symposium on Combinatorial Pattern Matching, CPM 2021, July 5-7, 2021, Wrocław, Poland, volume 191 of LIPIcs, pages 14:1–14:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPICS.CPM.2021.14.
  • [8] Massimo Equi, Veli Mäkinen, Alexandru I. Tomescu, and Roberto Grossi. On the complexity of string matching for graphs. ACM Trans. Algorithms, 19(3):21:1–21:25, 2023. doi:10.1145/3588334.
  • [9] Paul Erdös and George Szekeres. A combinatorial problem in geometry. Compositio mathematica, 2:463–470, 1935.
  • [10] Michael R. Fellows, Danny Hermelin, Frances A. Rosamond, and Stéphane Vialette. On the parameterized complexity of multiple-interval graph problems. Theor. Comput. Sci., 410(1):53–61, 2009. doi:10.1016/J.TCS.2008.09.065.
  • [11] Michael R Garey and David S Johnson. Computers and intractability, volume 174. freeman San Francisco, 1979.
  • [12] Johan Håstad. Some optimal inapproximability results. J. ACM, 48(4):798–859, 2001. doi:10.1145/502090.502098.
  • [13] Chirag Jain, Haowen Zhang, Yu Gao, and Srinivas Aluru. On the complexity of sequence to graph alignment. In Lenore J. Cowen, editor, Research in Computational Molecular Biology - 23rd Annual International Conference, RECOMB 2019, Washington, DC, USA, May 5-8, 2019, Proceedings, volume 11467 of Lecture Notes in Computer Science, pages 85–100. Springer, 2019. doi:10.1007/978-3-030-17083-7_6.
  • [14] David R. Karger, Rajeev Motwani, and G. D. S. Ramkumar. On approximating the longest path in a graph. Algorithmica, 18(1):82–98, 1997. doi:10.1007/BF02523689.
  • [15] Manuel Lafond, Wenfeng Lai, Adiesha Liyanage, and Binhai Zhu. The longest subsequence-repeated subsequence problem. In Weili Wu and Jianxiong Guo, editors, Combinatorial Optimization and Applications - 17th International Conference, COCOA 2023, Hawaii, HI, USA, December 15-17, 2023, Proceedings, Part I, volume 14461 of Lecture Notes in Computer Science, pages 446–458. Springer, 2023. doi:10.1007/978-3-031-49611-0_32.
  • [16] Wenfeng Lai, Adiesha Liyanage, Binhai Zhu, and Peng Zou. The longest letter-duplicated subsequence and related problems. Acta Informatica, 61(3):315–329, 2024. doi:10.1007/S00236-024-00459-7.
  • [17] Udi Manber and Sun Wu. Approximate string matching with arbitrary cost for text and hypertext. In Advances in Structural and Syntactic Pattern Recognition, pages 22–33, 1992. doi:10.1142/9789812797919_0002.
  • [18] Gonzalo Navarro. Improved approximate pattern matching on hypertext. Theoretical Computer Science, 237(1-2):455–463, 2000. doi:10.1016/S0304-3975(99)00333-3.
  • [19] Ngan Nguyen, Glenn Hickey, Daniel R. Zerbino, Brian J. Raney, Dent Earl, Joel Armstrong, W. James Kent, David Haussler, and Benedict Paten. Building a pan-genome reference for a population. Journal of Computational Biology, 22(5):387–401, 2015. doi:10.1089/cmb.2014.0146.
  • [20] Kunsoo Park and Dong Kyue Kim. String matching in hypertext. In Zvi Galil and Esko Ukkonen, editors, Combinatorial Pattern Matching, 6th Annual Symposium, CPM 95, Espoo, Finland, July 5-7, 1995, Proceedings, volume 937 of Lecture Notes in Computer Science, pages 318–329. Springer, 1995. doi:10.1007/3-540-60044-2_51.
  • [21] Sven Schrinner, Manish Goel, Michael Wulfert, Philipp Spohr, Korbinian Schneeberger, and Gunnar W. Klau. Using the longest run subsequence problem within homology-based scaffolding. Algorithms Mol. Biol., 16(1):11, 2021. doi:10.1186/S13015-021-00191-8.
  • [22] Robert Sedgewick and Kevin Wayne. Algorithms (Fourth edition deluxe). Addison-Wesley, 2016.
  • [23] Steven Skiena. The Algorithm Design Manual, Third Edition. Texts in Computer Science. Springer, 2020. doi:10.1007/978-3-030-54256-6.
  • [24] Robert Endre Tarjan. Depth-first search and linear graph algorithms. SIAM J. Comput., 1(2):146–160, 1972. doi:10.1137/0201010.
  • [25] The Computational Pan-Genomics Consortium. Computational pan-genomics: status, promises and challenges. Briefings in Bioinformatics, 19(1):118–135, 2018. doi:10.1093/bib/bbw089.
  • [26] David P. Williamson and David B. Shmoys. The Design of Approximation Algorithms. Cambridge University Press, 2011. URL: http://www.cambridge.org/de/knowledge/isbn/item5759340/?site_locale=de_DE.