Representing Paths in Digraphs

Dondi, Riccardo; Popa, Alexandru

doi:10.4230/LIPIcs.CPM.2025.1

Representing Paths in Digraphs

Riccardo Dondi

Università degli Studi di Bergamo, Italy Alexandru Popa

Department of Computer Science, University of Bucharest, Romania

Abstract

In this contribution we consider two combinatorial problems related to graph string matching, motivated by recent approaches in computational genomics. Given a DAG where each node is labeled by a symbol, the problems aim to find a path in the DAG whose nodes contain all (or the maximum number of) symbols of the alphabet. We introduce a decision problem, $\Sigma$ -Representing Path, that asks whether there exists a path that contains all the symbols of the alphabet, and an optimization problem, called Maximum Representing Path, that asks for a path that contains the maximum number of symbols. We analyze the complexity of the problems, showing the NP-completeness of $\Sigma$ -Representing Path when each symbol labels at most three nodes in the DAG, and showing the APX-hardness of Maximum Representing Path when each symbol labels at most two nodes in the DAG. We complement the first result by giving a polynomial-time algorithm for $\Sigma$ -Representing Path when each symbol labels at most two nodes in the DAG. Then we investigate the parameterized complexity of the two problems when the DAG has a limited distance from a set of disjoint paths and we show that both problems are W[1]-hard for this parameter. We consider the approximation of Maximum Representing Path, giving an approximation algorithm of factor $\sqrt{OPT}$ , where $O P T$ is the value of an optimal solution of the problem. We also show that Maximum Representing Path cannot be approximated within factor $\frac{e}{e-1}-\alpha$ , for any constant $\alpha>0$ , unless $NP\subseteq DTIME(|V|^{O(\log\log|V|)})$ ( $V$ is the set of nodes of the DAG).

Keywords and phrases:

Graph String Matching, Computational Complexity, Parameterized Complexity, Algorithms

Copyright and License:

2012 ACM Subject Classification:

Theory of computation

\rightarrow

Parameterized complexity and exact algorithms ; Theory of computation

\rightarrow

Design and analysis of algorithms ; Theory of computation

\rightarrow

Graph algorithms analysis

DOI:

10.4230/LIPIcs.CPM.2025.1

Event:

36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)

Editors:

Paola Bonizzoni and Veli Mäkinen

Series and Publisher:

Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik

1 Introduction

Graph string matching and approximate matching have been widely investigated due to their application in the context of pattern matching of hypertexts [1, 2, 20, 17] and panegenome analysis [19, 25, 5]. The problems consider a query string $s$ and a directed graph $D$ , whose nodes are labeled with symbols or strings. A path or a walk in $D$ is associated with a string that is obtained by concatenating the symbols or the strings on the path (or walk) nodes. The problems ask whether there exists a path or a walk that matches, or approximately matches, the query string. In exact matching the two strings have to be identical, while in approximated matching edit operations may be applied on the query string or on the node labels, in order to obtain two identical strings, and such edit operations have to be minimized.

Both matching and approximate matching can be solved in polynomial time if the input graph is a Directed Acyclic Graph (DAG) [17]. When the input digraph admits cycles the exact matching is solvable in polynomial time [1, 2, 20, 18, 13] and conditional lower bounds [8] show that improving the algorithms known in the literature is unlikely. The approximate matching is solvable in polynomial time when edit operations are applied only in the query string [13] and it is instead NP-hard when the edit operations may be applied on node labels [2], also for binary alphabet [13, 6].

A second research direction that has been recently investigated in sequence analysis, in the context of computational genomics, is the quest for subsequences of a given string that contain all symbols of the alphabet (and that have other specific properties) [21, 7, 15, 3, 16]. The property that each symbol of the alphabet is contained in a subsequence of a given string $s$ is applied in [21] looking for a run subsequence of maximum length, where a run subsequence contains at least one substring for each symbol of the alphabet. Another approach is considered in [15, 16] and it looks whether there exists a subsequence of $s$ that consists of substrings of repeated symbols that have length at least two, with the constraint that the subsequence contains each symbol of the alphabet. Both problems are NP-hard [21, 16] and results on their tractability have been given [21, 7, 3, 15, 16].

In this contribution, we introduce new problems that aim to integrate the two aforementioned approaches. On the one hand we consider a DAG having nodes labeled by symbols over an alphabet $\Sigma$ , since this allows us to represent variants of a sequence as in graph string matching. On the other hand we look for a path that contains all (or the maximum number of) symbols of $\Sigma$ , in a similar way to the second approach. If there exists a path in the DAG that contains all the symbols of the alphabet, we call this a representing path. We introduce a decision problem, called $\Sigma$ -Representing Path, where we ask whether there exists a representing path in a DAG $D$ . Since in some cases a representing path may not exist, we consider an optimization variant, called Maximum Representing Path, where we look for a path in $D$ that represents the maximum number of symbols in $\Sigma$ , that is the string associated with the path contains the maximum number of symbols of $\Sigma$ . Next, we summarize our results.

We start in Section 2 by presenting some concepts and by formally defining the two problems, $\Sigma$ -Representing Path and Maximum Representing Path. Then we investigate the complexity of the problems and we prove in Section 3 that $\Sigma$ -Representing Path is NP-complete even when each symbol labels at most three nodes of the input DAG, and Maximum Representing Path is APX-hard when each symbol labels at most two nodes of the DAG. Moreover, in both aforementioned cases, the input DAG has also the maximum degree bounded by three. In Section 3 we prove also a lower bound on the approximation: Maximum Representing Path cannot be approximated with factor $\frac{e}{e-1}-\alpha$ , for any constant $\alpha>0$ , unless $NP\subseteq DTIME(|V|^{O(\log\log|V|)})$ , where $V$ is the set of nodes of the input DAG. We complement the first hardness result by showing in Section 4 that $\Sigma$ -Representing Path can be solved in polynomial time when each symbol labels at most two nodes of $D$ .

Then we study in Section 5 how the complexity of the two problems is influenced by the structure of $D$ . Observe that if $D$ consists of a set of disjoint paths (except for the source and the target nodes of $D$ ), then the two problems are easy to solve by inspecting each path independently. We show that $\Sigma$ -Representing Path and Maximum Representing Path are W[1]-hard for parameter distance to disjoint paths. Finally, in Section 6 we present an approximation algorithm for the Maximum Representing Path problem of factor $\sqrt{OPT}$ , where $O P T$ is the value of an optimal solution of Maximum Representing Path. We conclude the paper in Section 7 pointing out some future directions. Some of the proofs are not included due to page limit (marked with $(^{*})$ ).

2 Preliminaries

We introduce some notation. For a natural number $n\in\mathbb{N}$ , we denote $[n]=\{1,\dots,n\}$ . A Directed Acyclic Graph (DAG) $D=(V,A)$ is a directed graph consisting of a set $V$ of nodes, a set $A=\{(u,v):u,v\in V\}$ of arcs, such that there is no directed cycle in $D$ . Given a finite alphabet $\Sigma$ and a DAG $D=(V,A)$ , we define a labeling function $\lambda$ that associates a symbol with each node of $D$ , that is, $\lambda:V\rightarrow\Sigma$ .

A path $p$ in $D$ is a sequence $v_{p,1}\dots v_{p,z}$ ( $v_{p,i}$ represents the $i$ -th node of path $p$ ) of adjacent distinct nodes of $V$ , that is $(v_{p,i},v_{p,i+1})\in A$ for $i\in[z-1]$ and $v_{p,i}\neq v_{p,j}$ , for each $i,j\in[z]$ with $i\neq j$ . The path is called a $v_{p,1}-v_{p,z}$ path, since it starts from $v_{p,1}$ and ends in $v_{p,z}$ . The set $\Sigma(p)$ of symbols represented by $p$ is defined as follows:

\Sigma(p)=\{c\in\Sigma:\lambda(v_{p,i})=c,\text{ for some }i\in[z]\}.

A path $p$ in $D$ (labeled by $\lambda$ ) is $\Sigma$ -representing if $\Sigma(p)=\Sigma$ (an example is given in Fig. 1); we denote the nodes of the path as $V(p)$ . When a path $p$ contains a node labeled by symbol $c\in\Sigma$ , we say that $p$ covers $c$ . We denote the length of $p$ (the number of nodes in $p$ ) by $|p|$ . A longest path in a graph is a path having maximum length. Finding a longest path in a graph is an NP-hard problem and even hard to approximate [14], but in DAGs it can be computed in linear time [22]. Now, we define the first problem we study in this paper (we assume that the labeling $\lambda$ is surjective, so each symbol in $\Sigma$ is associated with at least one node of the input graph $D$ ).

Problem 1.

( $\Sigma$ -Representing Path)
Input: A DAG $D=(V,A)$ , with a source node $s\in V$ , a target node $t\in V$ , a labeling $\lambda:V\rightarrow\Sigma$ .
Output: Is there an $s-t$ -path in $D$ that is $\Sigma$ -representing?

Since there are cases where a DAG does not contain an $s-t$ -path in $D$ that is $\Sigma$ -representing (see the example in Fig. 1), we consider a second problem, where we look for an $s-t$ -path in $D$ that contains the maximum number of distinct symbols of $\Sigma$ .

Problem 2.

(Maximum Representing Path)
Input: A DAG $D=(V,A)$ , with a source node $s\in V$ , a target node $t\in V$ , a labeling $\lambda:V\rightarrow\Sigma$ .
Output: An $s-t$ -path in $D$ that covers the maximum number of symbols in $\Sigma$ .

Figure 1: A DAG having labeled nodes (labels are inside each node, node names outside). If all the arcs (including the dashed arc between

v_{3}

and

v_{4}

) are in

D

there exists an

s-t

path that is

\Sigma

-representing:

sv_{1}v_{3}v_{4}t

. If the dashed arc is not in the graph, there is no

\Sigma

-representing path in

D

.

Given a DAG $D=(V,A)$ and a node $v\in V$ , the degree of $v$ is the number of arcs incoming to $v$ or outgoing from $v$ , that is

deg(v)=|\{(u,v)\in A\}\cup\{(v,u)\in A\}|.

The transitive closure of $D$ is a graph $D^{\prime}=(V,A^{\prime})$ where $A^{\prime}$ is defined as follows:

A^{\prime}=\{(u,v):u,v\in V,u\neq v\text{ and there is a path from $u$ to $v$ % in $D$}\}.

Given two DAGs $D_{1}=(V_{1},A_{1})$ and $D_{2}=(V_{2},A_{2})$ , with $V_{1}\cap V_{2}=\emptyset$ , having source nodes $s_{1}$ , $s_{2}$ , respectively, and target nodes $t_{1}$ , $t_{2}$ , respectively, the concatenation of $D_{1}$ and $D_{2}$ is a DAG $D=(V_{1}\cup V_{2},A_{1}\cup A_{2}\cup\{(t_{1},s_{2})\})$ . Informally $D$ is obtained by adding an arc from $t_{1}$ to $s_{2}$ . The definition can be easily extended to more than two DAGs, specifying the order of concatenation. Note that the definition of concatenation holds also for paths.

3 Hardness

In this section we present hardness results for the $\Sigma$ -Representing Path and Maximum Representing Path problems. In Subsection 3.1 we show the NP-completeness of the $\Sigma$ -Representing Path problem even if when each symbol labels at most three nodes of $D$ , whose degree is bounded by three. In Subsection 3.2 we show two hardness of approximation results for the Maximum Representing Path problem: first we show that the problem is APX-hard even if each symbol labels at most three nodes of $D$ , whose degree is bounded by three; then we show a stronger result on arbitrary instances, namely that Maximum Representing Path cannot be approximated within a factor of $\frac{e}{e-1}-\alpha$ , for any constant $\alpha>0$ , unless $NP\subseteq DTIME(|V|^{O(\log\log|V|)})$ .

3.1 NP-completeness

In this subsection we prove that the $\Sigma$ -Representing Path problem is NP-complete via a reduction from 3-SAT, a classical NP-complete problem (see, e.g., [11]). Our reduction holds even on restricted instances of the problems when each symbol labels at most three nodes of the input DAG $D$ and each node has degree bounded by three.

We recall that 3-SAT, given a formula $\phi$ in conjunctive normal form over a set of variables $X$ , where each clause is a disjunction of three literals (a variable or its negation), asks for an assignment of the variable set $X$ so that each clause of $\phi$ is satisfied.

Theorem 1.

The $\Sigma$ -Representing Path problem is NP-complete, even in the case when (1) each symbol labels at most three nodes of the DAG, that is, $\forall c\in\Sigma,|{v\in V:\lambda(v)=c}|\leq 3$ , and (2) the degree of each node is bounded by three.

Proof.

First, observe that the $\Sigma$ -Representing Path problem is in the class $N P$ since, given an $s-t$ path, we can verify in polynomial time if the path contains each symbol of $\Sigma$ .

Given an instance of the 3-SAT problem, that is a Boolean formula $\phi$ with $n$ variables and $m$ clauses, denote the $n$ variables of the formula $\phi$ as $x_{1},x_{2},\dots,x_{n}$ and the $m$ clauses as $C_{1},C_{2},\dots,C_{m}$ . We construct an instance of $\Sigma$ -Representing Path, that is, a labeled DAG $(D=(V,A),\lambda)$ , as follows (see an example in Fig. 2). First, assume without loss of generality that no variable $x_{i}$ appears only negated or nonnegated – otherwise said, every variable has at least one negated and one nonnegated occurrence in the formula $\phi$ . If there exists such a variable, we can simply assign it to True (if it is nonnegated) or to False (if it is negated) and remove all the clauses that contain the respective variable.

$\blacksquare$

The alphabet. For each clause $C_{i}$ , $i\in[m]$ , we add a corresponding symbol $c_{i}$ to the alphabet $\Sigma$ . Moreover, for each variable $x_{i}$ , $i\in[n]$ , we add symbols $x_{i,1}$ , $x_{i,2}$ to $\Sigma$ .
$\blacksquare$

The node set. For each variable $x_{i}$ , $i\in[n]$ , we add the following nodes in the node set. Assume that $x_{i}$ appears in $k$ clauses, $k_{1}\geq 1$ times nonnegated and $k_{2}\geq 1$ times negated. Thus, $k=k_{1}+k_{2}$ .

We first add to the node set two nodes: $s_{i},t_{i}$ . Then, we add nodes $u^{1}_{i},u^{2}_{i},\dots,u^{k_{1}}_{i}$ and nodes $v^{1}_{i},v^{2}_{i},\dots,v^{k_{2}}_{i}$ . The source node node of $D$ is $s=s_{1}$ and the target node of $D$ is $t=t_{n}$ . As for the labeling, $\lambda(s_{i})=x_{i,1}$ , $\lambda(t_{i})=x_{i,2}$ ; assuming $x_{i}$ , $i\in[k_{1}]$ , has the $j$ -th nonnegated appearance in clause $C_{h}$ then $\lambda(u_{i}^{j})=c_{h}$ ; assuming $x_{i}$ , $i\in[k_{2}]$ , has the $j$ -th negated appearance in clause $C_{h}$ then $\lambda(v_{i}^{j})=c_{h}$ .
$\blacksquare$
The arc set. For each $i\in[n]$ we add the following arcs:
- –
  
  Arcs outgoing from $s_{i}$ : $(s_{i},u^{1}_{i})$ , $(s_{i},v^{1}_{i})$
- –
  
  Arcs incoming in $t_{i}$ : $(u^{k_{1}}_{i},t_{i})$ , $(v^{k_{2}}_{i},t_{i})$
- –
  
  Arcs to connect two subgraphs associated with $x_{i}$ and $x_{i+1}$ , where $i\in[n-1]$ : $(t_{i},s_{i+1})$
- –
  
  Arcs $(u^{j}_{i},u^{j+1}_{i})$ , with $j\in[k_{1}-1]$ , and $(v^{j}_{i},v^{j+1}_{i})$ , with $j\in[k_{2}-1]$

We show now that our reduction is correct. More precisely, we show that (1) each symbol labels at most three nodes of $D$ , (2) the degree of $D$ is bounded by three and (3) the formula $\phi$ has a satisfying assignment, if and only if, there is an $s-t$ -path in $D$ that contains every symbol of $\Sigma$ .

Figure 2: A sketch of the DAG computed by the reduction from 3-SAT. We consider

x_{1}

to appear nonnegated in clauses

C_{1}

,

C_{2}

and negated in clause

C_{3}

;

x_{2}

to appear nonnegated in clause

C_{1}

and negated in clauses

C_{3}

and

C_{4}

.

Consider (1) and note that each symbol $x_{i,1}$ , $x_{i,2}$ , $i\in[n]$ , labels exactly one node of $D$ . Moreover, since each clause consists of three literals, then each of the symbols $c_{i}$ , $i\in[m]$ , labels exactly three nodes of $D$ .

Consider (2). The nodes of $D$ having degree larger than two, are possibly $s_{i}$ and $t_{i}$ , $i\in[n]$ . Each $s_{i}$ has indegree at most one and outdegree two, since the arcs outgoing from $s_{i}$ are $(s_{i},u_{i}^{1})$ and $(s_{i},v_{i}^{1})$ , thus $deg(s_{i})\leq 3$ . Each $t_{i}$ has outdegree at most one and indegree two, since the arcs incoming to $t_{i}$ are $(u_{i}^{k_{1}},t_{i})$ and $(v_{i}^{k_{2}},v_{i}^{1})$ . Hence $deg(t_{i})\leq 3$ .

Now, we prove (3). First, given a satisfying assignment of $\phi$ , we select the path in $D$ as follows. For each $i\in[n]$ , if $x_{i}$ is True, then in the subgraph corresponding to the variable $x_{i}$ , we take the path $s_{i}u^{1}_{i}u^{2}_{i}\dots u^{k_{1}}_{i}t_{i}$ thus, covering the symbols $x_{i,1}$ , $x_{i,2}$ , and $c_{1},c_{2},\dots,c_{k_{1}}$ , associated with clauses where $x_{i}$ appears nonnegated; otherwise, if $x_{i}$ is assigned to False, we choose the path $s_{i}v^{1}_{i}v^{2}_{i}\dots v^{k_{2}}_{i}t_{i}$ , thus covering the labels $x_{i,1}$ , $x_{i,2}$ , and $c_{1},c_{2},\dots,c_{k_{2}}$ , associated with clauses where $x_{i}$ appears negated. Between two subgraphs the path contains arc $(t_{i},s_{i+1})$ , $i\in[n-1]$ .

Observe that we take the path that covers precisely the symbols corresponding to the clauses satisfied by $x_{i}$ . Observe that the symbols $x_{i,1}$ , $x_{i,2}$ that are associated with variables are always covered. Since the formula $\phi$ is satisfied by the assignment, the path obtained by concatenating the subpaths also covers all the symbols in the alphabet.

Conversely, assume that we are given an $s-t$ -path in $D$ that covers all the symbols in $\Sigma$ . If the $s-t$ -path contains the subpath $s_{i}u^{1}_{i}u^{2}_{i}\dots u^{k_{1}}_{i}t_{i}$ , $i\in[n],$ then we set $x_{i}$ to True, otherwise we set $x_{i}$ to False. The crucial observation that was also mentioned in the first part of the reduction is that the symbols covered in the subgraph associated with variable $x_{i}$ correspond to the clauses satisfied by the assignment to $x_{i}$ . Since all the labels are covered, the assignment produced satisfies all the clauses. $\hfill\blacktriangleleft$

3.2 Hardness of Approximation

The previous result (Theorem 1) can be extended to Maximum Representing Path. By reducing from Max 2-SAT, which is known to be APX-hard [12], we show that Maximum Representing Path is APX-hard even when each symbol labels at most two nodes of the input DAG and the degree of each node is bounded by three.

Corollary 2 (^∗).

The Maximum Representing Path problem is APX-hard, even in the case when (1) each symbol labels at most two nodes of the input DAG, that is, $\forall c\in\Sigma,|{v\in V:\lambda(v)=c}|\leq 2$ , and (2) the degree of each node is bounded by three.

We now present the second inapproximability result, namely an approximation preserving reduction from the Max $k$ -Cover problem (defined next), that cannot be approximated within factor $\frac{e}{e-1}-\varepsilon$ , for any constant $\varepsilon>0$ , unless $NP\subseteq DTIME(|U|^{O(\log\log|U|)})$ .

Problem 3.

(Max $k$ -Cover)
Input: A collection of $n$ sets $\mathcal{S}=\{S_{1},S_{2},\dots S_{n}\}$ over a universe $U$ and an integer $k$ .
Output: $k$ sets from $\mathcal{S}$ whose union have the largest cardinality.

Now, we present our second inapproximability result.

Theorem 3 (^∗).

The Maximum Representing Path problem cannot be approximated within a factor of $\frac{e}{e-1}-\alpha$ , for any constant $\alpha>0$ , unless $NP\subseteq DTIME(|V|^{O(\log\log|V|)})$ .

Proof.

(Sketch) We present an approximation preserving reduction from Max $k$ -Cover to Maximum Representing Path. Given an input instance $(\mathcal{S},k)$ of Max $k$ -Cover, we construct the following instance of Maximum Representing Path. First, the set of nodes of the DAG $D$ is defined as follows:

$\blacksquare$

We add $k+1$ nodes $v_{1},v_{2},\dots v_{k+1}$ , where $s=v_{1}$ and $t=v_{k+1}$ , labeled with the same symbol $a$ .
$\blacksquare$

For each element $x\in U$ , define a path $p(x)$ of length $|U|$ , and each node of $p(x)$ is labeled with a distinct symbol $x_{i}$ , $i\in[|U|]$ .
$\blacksquare$

Between two nodes $v_{j}$ , $v_{j+1}$ , $j\in[k]$ , add $|\mathcal{S}|$ paths, each one associated with a set $S_{i}$ (denoted by $p(S_{i})$ ); each path associated with $S_{i}$ consists of the concatenation of paths $p(x_{i,1}),\cdots,p(x_{i,z})$ , where $x_{i,1},\dots,x_{i,z}$ are the elements in set $S_{i}$ .

Given a solution $S^{*}_{1},\dots,S^{*}_{k}$ of Max $k$ -Cover on instance $(U,\mathcal{S})$ such that $\bigcup_{i=1}^{k}|S^{*}_{i}|=h$ , we can compute in polynomial time a solution of Maximum Representing Path on instance $D$ that covers $|U|h+1$ symbols, by defining, between each two nodes $v_{j}$ and $v_{j+1}$ , $j\in[k]$ , path $p(S^{*}_{j})$ . For the other direction, given a solution $p$ of Maximum Representing Path on instance $D$ that covers $|U|h+1$ symbols, we can compute in polynomial time a solution $S^{*}_{1},\dots,S^{*}_{k}$ of Max $k$ -Cover on instance $(U,\mathcal{S})$ , by defining $S^{*}_{j}$ as the set corresponding to the path between nodes $v_{j}$ and $v_{j+1}$ , $j\in[k]$ . $\hfill\blacktriangleleft$

4 A Polynomial Time Algorithm for $\Sigma$ -Representing Path when Each Symbol Labels at most Two Nodes

In this section we present a polynomial time exact algorithm (Algorithm 1) for the $\Sigma$ -Representing Path problem when each symbol in $\Sigma$ labels at most two nodes of the input DAG $D$ . We first introduce the notion of compatible nodes.

Definition 4.

We say that two nodes $u_{1},u_{2}\in V$ are compatible if there exists an $s-t$ -path that contains both nodes, that is either a path $s\dots u_{1}\dots u_{2}\dots t$ or $s\dots u_{2}\dots u_{1}\dots t$ . Otherwise, the nodes are called incompatible.

Algorithm 1 reduces the input instance of the $\Sigma$ -Representing Path problem when each symbol labels at most two nodes to an instance of 2-SAT, which is know to be polynomial time solvable [24, 4].

Algorithm 1 A polynomial time exact algorithm for the

\Sigma

-Representing Path problem where each symbol labels at most two nodes of

D

.

Theorem 5 (^∗).

The $\Sigma$ -Representing Path problem is solvable in $O(|V||A|)$ time in the case when each symbol labels at most two nodes of $D$ , that is, $\forall c\in\Sigma,|{u\in V:\lambda(u)=c}|\leq 2$ .

5 Distance from Disjoint Paths

The $\Sigma$ -Representing Path and the Maximum Representing Path problems are trivial if the DAG, after the removal of $s$ and $t$ , consists of a set of disjoint paths. Here we consider the two problems when parameterized by distance to disjoint paths and we show that $\Sigma$ -Representing Path and Maximum Representing Path are W[1]-hard for this parameter. The distance to disjoint paths is defined as the minimum number of nodes to be removed from a graph (in this case $D$ ) such that the resulting graph consists of a set of node disjoint paths (except possibly for $s$ and $t)$ . We prove the hardness results by giving a parameterized reduction from the Multicolored Clique problem, defined as follows.

Problem 4.

(Multicolored Clique)
Input: An undirected graph $G=(W,E)$ , whose nodes are partitioned into color classes $W=W_{1}\uplus W_{2}\dots\uplus W_{k}$ and each edge in $E$ connects two nodes that are in a different color class.
Output: Is there a clique in $G$ that contains exactly one node for each color class?

A clique in $G$ that contains exactly one node for each color class is called a multicolored clique. Given an instance $G=(W,E)$ of the Multicolored Clique problem (recall that the partition of $W$ in color classes is given in input), we construct a corresponding instance $(D,\lambda)$ of $\Sigma$ -Representing Path. The DAG $D$ consists of three subgraphs, that share some nodes (see Fig. 3 for a representation of the structure of $D$ ): (1) a DAG $D_{1}$ with source node $s_{1}=s$ and target node $t_{1}$ , (2) A DAG $D_{2}$ with source node $s_{2}=t_{1}$ and target node $t_{2}$ , and (3) A DAG $D_{3}$ with source node $s_{3}=t_{2}$ and target node $t_{3}=t$ .

Figure 3: The structure of graph

D

, computed by the reduction from Multicolored Clique to

\Sigma

-Representing Path (dashed arcs represent DAGs between two nodes).

We first give an informal description of the reduction and then we present it formally. $D_{1}$ and $D_{2}$ encode a multicolored clique (symbols not covered by a path in $D_{1}$ and $D_{2}$ represent nodes and edges of a multicolored clique), $D_{3}$ enables to cover all the symbols not selected in $D_{1}$ and $D_{2}$ (assuming a path selected in $D_{1}$ and $D_{2}$ contains all the symbols except those encoding a multicolored clique).

Figure 4: A sketch of the DAG

D_{1}^{1}

(we include also

s

, the souruce of

D

, and

s^{2}_{1}

) computed by the reduction from Multicolored Clique to

\Sigma

-Representing Path.

Consider $G=(W,E)$ , where $W=\{w_{1,1},\dots,w_{k,|W_{k}|}\}$ , and $w_{i,h}$ , $i\in[k]$ and $h\in[|W_{i}|]$ , represents the $h$ -th node of color class $W_{i}$ (we assume that the nodes in each color class have some ordering). We start by giving an informal description of $D_{1}$ , $D_{2}$ and $D_{3}$ .

The DAG $D_{1}$ consists of $k$ concatenated DAGs $D_{1}^{1}$ , …, $D_{1}^{k}$ (see Fig. 4 for a sketch of $D_{1}^{1}$ ), where each $D_{1}^{i}$ , $i\in[k]$ , is associated with a color class $W_{i}$ . Each $D_{1}^{i}$ , $i\in[k]$ , encodes the selection of exactly one node of $W_{i}$ and it contains a path, denoted by $D_{1}(w_{i,h})$ , for each node $w_{i,h}\in W_{i}$ , with $h\in[|W_{i}|]$ . $D_{1}^{i}$ has a source node $s_{1}^{i}$ and a target node $t_{1}^{i}$ . Node $s_{1}^{i}$ has arcs to the first node of each path $D_{1}(w_{i,h})$ . Each $D_{1}(w_{i,h})$ contains the symbols associated with nodes in $W_{i}$ , except for the symbols associated with $w_{i,h}$ .

The DAG $D_{2}$ consists of $k(k-1)$ concatenated DAGs $D_{2}^{1,2}$ , …, $D_{2}^{k,k-1}$ (see Fig. 5 for a sketch of $D_{2}^{1,2}$ ), where each $D_{2}^{i,j}$ , $i,j\in[k]$ , $i\neq j$ , is associated with edges connecting nodes of color class $W_{i}$ and color class $W_{j}$ . DAG $D_{2}^{i,j}$ has a source node $s_{2}^{i,j}$ and a target node $t_{2}^{i,j}$ . For each edge $\{w_{i,h},w_{j,q}\}\in E$ , $h\in[|W_{i}|]$ and $q\in[|W_{j}|]$ , $D_{2}^{i,j}$ contains one path, denoted by $D_{2}(w_{i,h},w_{j,q})$ whose nodes are labeled by one symbol associated with $w_{i,h}$ , and set $L_{i,j}$ (encoding edges between nodes of $W_{i}$ and of $W_{j}$ ) except for the symbol associated with edge $\{w_{i,h},w_{j,q}\}$ .

Figure 5: The first part of subgraph

D_{2}^{1,2}

computed by the reduction from Multicolored Clique. We assume that

w_{1,1}

is adjacent to

w_{2,1}

,

w_{2,j}

and

w_{2,|W_{2}|}

.

The DAG $D_{3}$ (see Fig. 6) consists of $k$ concatenated DAGs $D_{3}^{1}$ , …, $D_{3}^{k}$ , each one associated with a color class $W_{i}$ , $i\in[k]$ . Each $D_{3}^{i}$ , $i\in[k]$ , has a source node $s_{3}^{i}$ and a target node $t_{3}^{i}$ . Node $s_{3}^{i}$ has arcs to $|W_{i}|$ paths, one path $D_{3}(w_{i,h})$ for each node $w_{i,h}\in W_{i}$ , $h\in[|W_{i}|]$ . The nodes in $D_{3}(w_{i,h})$ have labels that encode $w_{i,h}$ and the edges incident in $w_{i,h}$ . The idea is that the symbols not covered by a path in $D_{1}$ and $D_{2}$ can be covered by a path in $D_{3}$ only if the path in $D_{1}$ and $D_{2}$ contains all the symbols except those encoding edges of a multicolored clique in $G$ and one symbol for each node in the multicolored clique.

Figure 6: The first subgraph of the DAG

D_{3}

computed by the reduction from Multicolored Clique, associated with the nodes in

W_{1}

. We assume that

w_{1,j}

is adjacent to

w_{2,1}

,

w_{3,2}

, thus

D_{3}(w_{1,j})

is a path whose nodes have labels

l_{2,1,1,j},l_{3,2,1,j}

(these labels are represented in the box of

D_{3}(w_{1,j})

). We include also

s_{3}

, the source of

D_{3}

, and node

s_{3}^{2}

.

Now, we present the details of the reduction. We start by defining the alphabet $\Sigma$ and some subsets of $\Sigma$ .

	$\displaystyle\Sigma=$	$\displaystyle\{a_{i}:W_{i}\subseteq W,i\in[k]\}\ \cup\{b_{i,h,q}:w_{i,h}\in W_% {i},i\in[k],h\in[\|W_{i}\|],q\in[k]\}\ \cup$
		$\displaystyle\{l_{i,h,j,q}:i,j\in[k],i\neq j,h\in[\|W_{i}\|],q\in[\|W_{j}\|]\wedge w% _{i,h}\in W_{i}\wedge w_{j,q}\in W_{j}\wedge\{w_{i,h},w_{j,q}\}\in E\}.$

We define sets $B(w_{i,h})$ , $w_{i,h}\in W_{i}$ , and $B_{i}$ , $i\in[k]$ , of symbols:

B(w_{i,h})=\bigcup_{q\in[k]}b_{i,h,q},\quad B_{i}=\bigcup_{h\in[|W_{i}|]}B(w_{% i,h}).

Given $i,j\in[k]$ , with $i\neq j$ , we denote the subset $L_{i,j}$ of symbols as follows:

L_{i,j}=\{l_{i,h,j,q}:h\in[|W_{i}|],q\in[|W_{j}|]\wedge w_{i,h}\in W_{i}\wedge w% _{j,q}\in W_{j}\wedge\{w_{i,h},w_{j,q}\}\in E\}.

The set $L_{i}$ , $i\in[k]$ , is defined as follows:

L_{i}=\bigcup_{j\in[k]\wedge j\neq i}L_{i,j}.

Note that $L_{i,j}$ and $L_{j,i}$ are different subsets, in particular that $l_{i,h,j,q}\neq l_{j,q,i,h}$ .

Now, we define the DAG $D_{1}$ . $D_{1}$ has a source node $s_{1}=s$ and a target node $t_{1}$ , both labeled by $a_{1}$ . $D_{1}$ is obtained by concatenating DAGs $D_{1}^{i}$ , $i\in[k]$ , each one associated with a color class. Each $D_{1}^{i}$ has a source node $s_{1}^{i}$ and a target node $t_{1}^{i}$ . For each $i\in[k-1]$ , there exists an arc from $t_{1}^{i}$ to $s_{1}^{i+1}$ (this defines the concatenation of subgraphs $D_{1}^{i}$ ). Now, we define each subgraph $D_{1}^{i}$ , $i\in[k]$ :

$\blacksquare$

Each subgraph $D_{1}^{i}$ has a source node $s_{1}^{i}$ and a target node $t_{1}^{i}$ , labeled by $a_{i}$ .
$\blacksquare$

Node $s_{1}^{i}$ is connected to $|W_{i}|$ disjoint paths $D_{1}(w_{i,h})$ , each one associated with a node $w_{i,h}\in W_{i}$ .
$\blacksquare$

Each path $D_{1}(w_{i,h})$ is labeled by symbols

$B_{i}\setminus\bigcup_{q\in[k]}\{b_{i,h,q}\}$

Finally, there is an arc from node $s_{1}$ to $s_{1}^{1}$ and an arc from node $t_{1}^{k}$ to node $t_{1}$ .

Next, we define the DAG $D_{2}$ . $D_{2}$ has a source node $s_{2}=t_{1}$ and a target node $t_{2}$ , both labeled by symbol $a_{2}$ . $D_{2}$ is obtained by concatenating DAGs $D_{2}^{i,j}$ , with $i,j\in[k]$ and $i\neq j$ , each one associated with $W_{i}$ and $W_{j}$ . Each $D_{2}^{i,j}$ has a source node $s_{2}^{i,j}$ and a target node $t_{2}^{i,j}$ . We assume that DAGs $D_{2}^{i,j}$ are concatenated as follows: for $i,j\in[k]$ with $i\neq j$ , if $j<k$ then there is an arc from the target $t_{2}^{i,j}$ of $D_{2}^{i,j}$ to the source $s_{2}^{i,j+1}$ of $D_{2}^{i,j+1}$ , and if $j=k$ (and $i<k$ ) there is an arc from the source $s_{2}^{i,j}$ of $D_{2}^{i,j}$ to the target $t_{2}^{i+1,1}$ of $D_{2}^{i+1,1}$ .

Now, we define each subgraph $D_{2}^{i,j}$ , $i,j\in[k]$ and $i\neq j$ :

$\blacksquare$

Each subgraph $D_{2}^{i,j}$ has a source node $s_{2}^{i,j}$ and a target node $t_{2}^{i,j}$ , labeled by $a_{i}$ .
$\blacksquare$

Node $s_{2}^{i,j}$ is connected to paths $D_{2}(w_{i,h},w_{j,q})$ , each one associated with a node $w_{i,h}\in W_{i}$ and a node $w_{j,q}\in W_{j}$ , such that $\{w_{i,h},w_{j,q}\}\in E$ .
$\blacksquare$

Each path $D_{2}(w_{i,h},w_{j,q})$ is labeled by the set of symbols (recall that $i\neq j$ , hence symbol $b_{i,h,i}$ does not label any node on each path $D_{2}(w_{i,h},w_{j,q})$ ):

$(L_{i,j}\cup\{b_{i,h,j}\})\setminus\{l_{i,h,j,q}\}$

Finally, there is an arc from node $s_{2}$ to $s_{2}^{1,2}$ and an arc from node $t_{2}^{k,k-1}$ to node $t_{2}$ .

Now, we define the DAG $D_{3}$ . $D_{3}$ has source $s_{3}$ and target $t_{3}$ , both labeled by $a_{3}$ . $D_{3}$ is obtained by concatenating DAGs $D_{3}^{i}$ , $i\in[k]$ , each one associated with a color class. Each subgraph $D_{3}^{i}$ , $i\in[k]$ , is defined as follows:

$\blacksquare$

Each subgraph $D_{3}^{i}$ has a source node $s_{3}^{i}$ and a target node $t_{3}^{i}$ , labeled by $a_{i}$ .
$\blacksquare$

Between nodes $s_{3}^{i}$ and $t_{3}^{i}$ there are $|W_{i}|$ disjoint paths, $D_{3}(w_{i,h})$ , $h\in[|W_{i}|]$ , each one associated with a node $w_{i,h}\in W_{i}$ .
$\blacksquare$

The nodes in each path $D_{3}(w_{i,h})$ are labeled by the following set of symbols

$\bigcup_{\{w_{i,h},w_{j,q}\}\in E}\{l_{j,q,i,h}\}\cup\{b_{i,h,i}\}.$

Note that the path $D_{3}(w_{i,h})$ contains nodes having labels $\{l_{j,q,i,h}\}$ that encode edges of $E$ incident in $w_{i,h}$ . Note also that these nodes, for each edge $\{w_{i,h},w_{j,q}\}\in E$ , have labels $\{l_{j,q,i,h}\}$ not $\{l_{i,h,j,q}\}$ .

Finally, there is an arc from $t_{3}^{i}$ to $s_{3}^{i+1}$ , with $i\in[k-1]$ , an arc from $s_{3}$ (the source of $D_{3}$ ) to $s_{3}^{1}$ and an arc from $t_{3}^{k}$ to $t_{3}$ .

Having defined $D$ and its labeling, we start to prove some properties of graph $D_{1}$ and $D_{2}$ .

Lemma 6.

Consider an instance $G$ of Multicolored Clique and a corresponding instance $(D,\lambda)$ of $\Sigma$ -Representing Path. Given a path $p$ from $s_{1}$ to $t_{1}$ in $D_{1}$ , then $p$ covers the following set of symbols:

1.

Each symbol $a_{i}$ , $i\in[k]$
2.

A set $B^{\prime}$ defined as follows:

$B^{\prime}=\bigcup_{i\in[k]}B_{i}\setminus\{b_{i,h,q}:\text{such that $p$ traverses path }D_{1}(w_{i,h}),q\in[k]\}.$

Proof.

Let $p$ be a path from $s_{1}$ to $t_{1}$ in $D_{1}$ . Nodes $s^{i}_{1}$ and $t^{i}_{1}$ , $i\in[k]$ , must be traversed by any path in $D_{1}$ , hence also by $p$ , and they are labeled by $a_{i}$ . Hence point 1 holds.

We prove now point 2. Consider in particular a DAG $D^{i}_{1}$ , $i\in[k]$ , and the subpath $p_{i}$ of $p$ in $D^{i}_{1}$ . By construction, $p_{i}$ traverses exactly one of $D_{1}(w_{i,h})$ , with $h\in[|W_{i}|]$ , between $s^{i}_{1}$ and $t^{i}_{1}$ . Also note that by construction a subgraph of $D_{1}(w_{i,h})$ contains the symbols $B_{i}$ , except for the symbols $b_{i,h,q}$ , $q\in[k]$ , that by construction do not label any node of $D_{1}(w_{i,h})$ , thus point 2 holds. $\hfill\blacktriangleleft$

Lemma 7.

Consider an instance $G$ of Multicolored Clique and a corresponding instance $(D,\lambda)$ of $\Sigma$ -Representing Path. A path $p$ in $D$ that covers all the symbols in $\Sigma$ contains a path $p_{2}$ from $s_{2}$ to $t_{2}$ in $D_{2}$ such that:

1.

For each $D_{1}(w_{i,h})$ traversed by $p$ in $D_{1}$ , $p_{2}$ traverses a subgraph $D_{2}(w_{i,h},w_{j,q})$ , for some $j\in[k]$ , $q\in[|W_{j}|]$ and covers a set $B^{\prime\prime}=\bigcup_{i,q\in[k],i\neq q,h\in[|W_{i}|]}\{b_{i,h,q}:$ such that $p$ traverses path $D_{1}(w_{i,h})\}$
2.

For each $i\in[k]$ , $p_{2}$ covers a set $L^{\prime}_{i}\subseteq L_{i}$ of symbols, where $L^{\prime}_{i}=L_{i}\setminus N_{i}$ and $N_{i}$ is defined as follows:

$N_{i}=\bigcup_{h,j,q\text{ such that $p_{2}$ traverses subgraph $D_{2}(w_{i,h},w_{j,q})$}}\{l_{i,h,j,q}\}.$

Proof.

Let $p$ be a path from $s$ to $t$ that covers all the symbols in $\Sigma$ . First, consider point 1. By Lemma 6, the path $p$ in $D_{1}$ does not cover the set of symbols $b_{i,h,q}$ , $h\in[|W_{i}|]$ , $q\in[k]$ , such that $D_{1}(w_{i,h})$ is traversed by $p$ . Each symbol $b_{i,h,q}$ , with $i,q\in[k]$ , $i\neq q$ and $h\in[|W_{i}|]$ , labels only nodes of $D_{1}^{i}$ and $D_{2}$ , thus if it is not covered by $p$ in $D_{1}^{i}$ it must be covered in $D_{2}$ . It follows that for each $D_{1}(w_{i,h})$ traversed by $p$ in $D_{1}$ , $p_{2}$ traverses a subgraph $D_{2}(w_{i,h},w_{j,q})$ . This implies that $p_{2}$ covers

B^{\prime\prime}=\bigcup_{i,h,q,i,q\in[k],i\neq q,h\in[|W_{i}|]}\{b_{i,h,q}:% \text{ such that $p$ traverses path }D_{1}(w_{i,h})\}

and point 1 is proven.

Now, we consider point 2. In each $D^{i,j}_{2}$ , $i,j\in[k]$ and $i\neq j$ , path $p$ traverses exactly one subpath $D_{2}(w_{i,h},w_{j,q})$ , for some $h\in[|W_{i}|]$ and $q\in[|W_{j}|]$ , whose nodes have labels $L_{i,j}\setminus\{l_{i,h,j,q}\}$ . Then $p$ in $D^{i}_{2}$ covers a set $L^{\prime}_{i}\subseteq L_{i}$ of symbols, where $L^{\prime}_{i}=L_{i}\setminus N_{i}$ and

N_{i}=\bigcup_{h,q\text{ with $D_{2}(w_{i,h},w_{j,q})$ traversed by $p$ in }D_% {2}^{i,j}}\{l_{i,h,j,q}\}

hence point 2 is proven. $\hfill\blacktriangleleft$ Based on Lemma 6 and Lemma 7, we can prove the main result of this section.

Lemma 8 (^∗).

Consider an instance $G$ of Multicolored Clique and a corresponding instance $(D,\lambda)$ of $\Sigma$ -Representing Path. Then, $G$ contains a multicolored clique if and only if there exists a path in $D$ that covers all the symbols in $\Sigma$ .

$D$ has distance from a set of disjoint paths bounded by $2k(k-1)+4k$ , since by removing the source and target node of each $D_{1}^{i}$ , with $i\in[k]$ , of each $D_{2}^{i,j}$ , with $i,j\in[k]$ and $i\neq j$ , and of each $D_{3}^{i}$ , with $i\in[k]$ , we obtain a set of disjoint paths. Since Multicolored Clique is W[1]-hard when parameterized by $k$ [10], we can prove the following theorem.

Theorem 9 (^∗).

$\Sigma$ -Representing Path is W[1]-hard when parameterized by distance to disjoint paths.

We extend the result of Theorem 9 to the Maximum Representing Path problem.

Corollary 10 (^∗).

Maximum Representing Path is W[1]-hard when parameterized by distance to disjoint paths.

6 An Approximation Algorithm

In this section we present a polynomial time approximation algorithm for the Maximum Representing Path problem that achieves an approximation factor of $\sqrt{OPT}$ , where $O P T$ is the number of distinct symbols in an optimal solution. Notice that, since $OPT\leq|\Sigma|$ , our algorithm is also a $\sqrt{|\Sigma|}$ -approximation. Informally, the algorithm is as follows. First, we create a compatibility DAG $D^{\prime}=(V^{\prime},A^{\prime})$ , that is essentially the transitive closure of the input DAG $D$ (see Section 2 for the definition of transitive closure). Then, we consider a total order on $\Sigma$ , e.g., standard alphabetic order, such that we have $\lambda(u)<\lambda(v)$ , for two nodes $u$ and $v$ if the label of $u$ precedes the label of $v$ based on this order. We create two subgraphs of $D^{\prime}$ (that are implicitly DAGs), $D^{1}$ and $D^{2}$ as follows:

1.

$D^{1}=(V,A^{1})$ where $(v_{1},v_{2})\in A^{1}$ if and only if $(v_{1},v_{2})\in A^{\prime}$ and $\lambda(v_{1})<\lambda(v_{2})$
2.

$D^{2}=(V,A^{2})$ where $(v_{1},v_{2})\in A^{2}$ if and only if $(v_{1},v_{2})\in A^{\prime}$ and $\lambda(v_{1})>\lambda(v_{2})$ .

Observe that $A^{1},A^{2}\subseteq A$ and that the set of nodes of $D^{1}$ and $D^{2}$ is $V$ . We then compute $p_{1}$ , a longest path between $s$ and $t$ in $D^{1}$ , and $p_{2}$ , a longest path between $s$ and $t$ in $D^{2}$ . As pointed out in Section 2, a longest path in a DAG can be computed in linear time. The algorithm (formally presented in Algorithm 2) outputs the path $p_{i}$ , $i\in\{1,2\}$ , that has a largest number of distinct symbols. Theorem 11 proves that it is indeed a $\sqrt{OPT}$ -approximation for Maximum Representing Path.

Algorithm 2 A

\sqrt{OPT}

-approximation algorithm for Maximum Representing Path.

Theorem 11.

Algorithm 2 is a $\sqrt{OPT}$ -approximation for the Maximum Representing Path problem and requires $O(|V||A|)$ time.

Proof.

First, we prove that Algorithm 2 computes a feasible solution of Maximum Representing Path, that is, the path $p^{\prime}$ returned by the Algorithm 2 is an $s-t$ -path in D. Consider the DAG $D^{\prime}=(V,A^{\prime})$ . Since $D^{\prime}$ is the transitive closure of $D$ , it follows that for any path in $D^{\prime}$ there exists a corresponding path in $D$ , since for any arc $(u,v)\in A^{\prime}$ , there is a path $p(u,v)$ from $u$ to $v$ in $D$ . Thus, given a path $p^{*}$ in $D^{\prime}$ , we can compute a corresponding path $p$ in $D$ by concatenating the paths $p(u,v)$ associated with arcs $(u,v)$ in $p^{*}$ . Since $D^{1}$ and $D^{2}$ are subgraphs of $D^{\prime}$ , in particular they have the same set of nodes and a subset of the arcs of $D^{\prime}$ , any path in $D^{1}$ or $D^{2}$ corresponds to a path in $D^{\prime}$ , hence also in $D$ . Moreover, we add $s$ ( $t$ , respectively) to the returned path if $s$ (or $t$ , respectively) is not part of the path $p$ , and possibly a path from $s$ to the first node of $p$ (a path from the last node of $p$ to $t$ , respectively), hence the algorithm returns an $s-t$ path in $D$ .

We show now the approximation factor of Algorithm 2. Let $p^{o}$ be an optimal solution of the Maximum Representing Path problem, that is, a path $p^{o}$ such that $|\Sigma(p^{o})|=OPT$ is maximized. Let $V^{o}=\{v_{1},v_{2}\dots,v_{OPT}\}$ be a subset of the nodes that appear on the path $p^{o}$ in this order and have pairwise distinct labels, that is $\lambda(v_{i})\neq\lambda(v_{j}),\forall 1\leq i<j\leq OPT.$

According to the Erdös-Szekeres theorem [9], every sequence of $z^{2}+1$ distinct integers contains a monotonic (increasing or decreasing) sequence of length $z+1$ . Thus, the string associated with $p^{o}$ contains a a monotonic (increasing or decreasing) sequence of length at least $\sqrt{OPT}$ , hence the path $p^{o}$ contains $k\geq\sqrt{OPT}$ nodes $v_{i_{1}},v_{i_{2}},\dots,v_{i_{k}}$ , such that $v_{i_{x}}$ appears before $v_{i_{y}}$ in $p$ , for $x<y$ and $x,y\in[k]$ , and either $\lambda(v_{i_{1}})<\lambda(v_{i_{2}})<\dots<\lambda(v_{i_{k}})$ or $\lambda(v_{i_{1}})>\lambda(v_{i_{2}})>\dots>\lambda(v_{i_{k}})$ . Notice that since $p^{o}$ is a path in $D$ , for any two nodes $v_{i},v_{j}\in p^{o}$ such that $i<j$ , we have $(v_{i},v_{j})\in A^{\prime}$ . Thus, the path $v_{i_{1}},v_{i_{2}},\dots,v_{i_{k}}$ is either a path in $D^{1}$ or in $D^{2}$ , thus it has length not larger than that of $p_{1}$ or of $p_{2}$ , respectively. Since the path $p^{\prime}$ returned by Algorithm 2 is obtained by taking the nodes in one of $\{p_{1},p_{2}\}$ that covers the maximum number of symbols (and possibly adding other nodes), and each $p_{i}$ , $i\in\{1,2\}$ , contains nodes with distinct labels, then $|\Sigma(p^{\prime})|\geq k\geq\sqrt{OPT}$ . Thus, Algorithm 2 is a $\sqrt{OPT}$ -approximation algorithm for Maximum Representing Path.

We consider now the time complexity of Algorithm 2. Step 1 can be computed in $O(|V||A|)$ time [23] and $|A^{\prime}|$ contains $O(|V|^{2})$ arcs. Step 2 and Step 3, can be computed in $O(|V|+|A^{\prime}|)$ time by traversing $D^{\prime}$ . Step 4 and 5 can be computed in $O(|V|+|A^{1}|)$ and $O(|V|+|A^{2}|)$ time [26], respectively, where $O(|V|+|A^{1}|)$ and $O(|V|+|A^{2}|)$ are bounded by $O(|V|+|A^{\prime}|)$ . Thus the overall time complexity is $O(|V||A|+|V|^{2})$ and, since we can assume that no node is isolated, then $|A|\geq|V|-1$ , thus the overall time complexity is $O(|V||A|)$ . $\hfill\blacktriangleleft$

7 Conclusion

In this contribution we have introduced two combinatorial problems ( $\Sigma$ -Representing Path and Maximum Representing Path) that ask to identify a path in a node labeled DAG that contains all (or a subset of maximum size) of the alphabet symbols. We have proved results on the computational complexity and parameterized complexity of the two problems, and we have studied the approximation of Maximum Representing Path.

One of the most interesting future directions is to further investigate the approximate complexity of the Maximum Representing Path problem: does it admit constant factor approximation algorithms? It is interesting to design approximation algorithms for some restrictions on the DAG structure, for example when the degree is bounded. It is also interesting to study other structural properties of the DAG that may lead to polynomial-time algorithms for both problems.

References

[1] Tatsuya Akutsu. A linear time pattern matching algorithm between a string and a tree. In Combinatorial Pattern Matching, 4th Annual Symposium, CPM 93, Padova, Italy, June 2-4, 1993, Proceedings, pages 1–10, 1993. doi:10.1007/BFb0029792.
[2] Amihood Amir, Moshe Lewenstein, and Noa Lewenstein. Pattern matching in hypertext. Journal of Algorithms, 35(1):82–99, 2000. doi:10.1006/jagm.1999.1063.
[3] Yuichi Asahiro, Hiroshi Eto, Mingyang Gong, Jesper Jansson, Guohui Lin, Eiji Miyano, Hirotaka Ono, and Shunichi Tanaka. Approximation algorithms for the longest run subsequence problem. In Laurent Bulteau and Zsuzsanna Lipták, editors, 34th Annual Symposium on Combinatorial Pattern Matching, CPM 2023, June 26-28, 2023, Marne-la-Vallée, France, volume 259 of LIPIcs, pages 2:1–2:12. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPICS.CPM.2023.2.
[4] Bengt Aspvall, Michael F. Plass, and Robert Endre Tarjan. A linear-time algorithm for testing the truth of certain quantified boolean formulas. Inf. Process. Lett., 8(3):121–123, 1979. doi:10.1016/0020-0190(79)90002-4.
[5] Jasmijn A. Baaijens, Paola Bonizzoni, Christina Boucher, Gianluca Della Vedova, Yuri Pirola, Raffaella Rizzi, and Jouni Sirén. Computational graph pangenomics: a tutorial on data structures and their applications. Nat. Comput., 21(1):81–108, 2022. doi:10.1007/S11047-022-09882-6.
[6] Riccardo Dondi, Giancarlo Mauri, and Italo Zoppis. On the complexity of approximately matching a string to a directed graph. Inf. Comput., 288:104748, 2022. doi:10.1016/J.IC.2021.104748.
[7] Riccardo Dondi and Florian Sikora. The longest run subsequence problem: Further complexity results. In Pawel Gawrychowski and Tatiana Starikovskaya, editors, 32nd Annual Symposium on Combinatorial Pattern Matching, CPM 2021, July 5-7, 2021, Wrocław, Poland, volume 191 of LIPIcs, pages 14:1–14:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPICS.CPM.2021.14.
[8] Massimo Equi, Veli Mäkinen, Alexandru I. Tomescu, and Roberto Grossi. On the complexity of string matching for graphs. ACM Trans. Algorithms, 19(3):21:1–21:25, 2023. doi:10.1145/3588334.
[9] Paul Erdös and George Szekeres. A combinatorial problem in geometry. Compositio mathematica, 2:463–470, 1935.
[10] Michael R. Fellows, Danny Hermelin, Frances A. Rosamond, and Stéphane Vialette. On the parameterized complexity of multiple-interval graph problems. Theor. Comput. Sci., 410(1):53–61, 2009. doi:10.1016/J.TCS.2008.09.065.
[11] Michael R Garey and David S Johnson. Computers and intractability, volume 174. freeman San Francisco, 1979.
[12] Johan Håstad. Some optimal inapproximability results. J. ACM, 48(4):798–859, 2001. doi:10.1145/502090.502098.
[13] Chirag Jain, Haowen Zhang, Yu Gao, and Srinivas Aluru. On the complexity of sequence to graph alignment. In Lenore J. Cowen, editor, Research in Computational Molecular Biology - 23rd Annual International Conference, RECOMB 2019, Washington, DC, USA, May 5-8, 2019, Proceedings, volume 11467 of Lecture Notes in Computer Science, pages 85–100. Springer, 2019. doi:10.1007/978-3-030-17083-7_6.
[14] David R. Karger, Rajeev Motwani, and G. D. S. Ramkumar. On approximating the longest path in a graph. Algorithmica, 18(1):82–98, 1997. doi:10.1007/BF02523689.
[15] Manuel Lafond, Wenfeng Lai, Adiesha Liyanage, and Binhai Zhu. The longest subsequence-repeated subsequence problem. In Weili Wu and Jianxiong Guo, editors, Combinatorial Optimization and Applications - 17th International Conference, COCOA 2023, Hawaii, HI, USA, December 15-17, 2023, Proceedings, Part I, volume 14461 of Lecture Notes in Computer Science, pages 446–458. Springer, 2023. doi:10.1007/978-3-031-49611-0_32.
[16] Wenfeng Lai, Adiesha Liyanage, Binhai Zhu, and Peng Zou. The longest letter-duplicated subsequence and related problems. Acta Informatica, 61(3):315–329, 2024. doi:10.1007/S00236-024-00459-7.
[17] Udi Manber and Sun Wu. Approximate string matching with arbitrary cost for text and hypertext. In Advances in Structural and Syntactic Pattern Recognition, pages 22–33, 1992. doi:10.1142/9789812797919_0002.
[18] Gonzalo Navarro. Improved approximate pattern matching on hypertext. Theoretical Computer Science, 237(1-2):455–463, 2000. doi:10.1016/S0304-3975(99)00333-3.
[19] Ngan Nguyen, Glenn Hickey, Daniel R. Zerbino, Brian J. Raney, Dent Earl, Joel Armstrong, W. James Kent, David Haussler, and Benedict Paten. Building a pan-genome reference for a population. Journal of Computational Biology, 22(5):387–401, 2015. doi:10.1089/cmb.2014.0146.
[20] Kunsoo Park and Dong Kyue Kim. String matching in hypertext. In Zvi Galil and Esko Ukkonen, editors, Combinatorial Pattern Matching, 6th Annual Symposium, CPM 95, Espoo, Finland, July 5-7, 1995, Proceedings, volume 937 of Lecture Notes in Computer Science, pages 318–329. Springer, 1995. doi:10.1007/3-540-60044-2_51.
[21] Sven Schrinner, Manish Goel, Michael Wulfert, Philipp Spohr, Korbinian Schneeberger, and Gunnar W. Klau. Using the longest run subsequence problem within homology-based scaffolding. Algorithms Mol. Biol., 16(1):11, 2021. doi:10.1186/S13015-021-00191-8.
[22] Robert Sedgewick and Kevin Wayne. Algorithms (Fourth edition deluxe). Addison-Wesley, 2016.
[23] Steven Skiena. The Algorithm Design Manual, Third Edition. Texts in Computer Science. Springer, 2020. doi:10.1007/978-3-030-54256-6.
[24] Robert Endre Tarjan. Depth-first search and linear graph algorithms. SIAM J. Comput., 1(2):146–160, 1972. doi:10.1137/0201010.
[25] The Computational Pan-Genomics Consortium. Computational pan-genomics: status, promises and challenges. Briefings in Bioinformatics, 19(1):118–135, 2018. doi:10.1093/bib/bbw089.
[26] David P. Williamson and David B. Shmoys. The Design of Approximation Algorithms. Cambridge University Press, 2011. URL: http://www.cambridge.org/de/knowledge/isbn/item5759340/?site_locale=de_DE.

[bib.bib1] [1] Tatsuya Akutsu. A linear time pattern matching algorithm between a string and a tree. In Combinatorial Pattern Matching, 4th Annual Symposium, CPM 93, Padova, Italy, June 2-4, 1993, Proceedings, pages 1–10, 1993. doi:10.1007/BFb0029792.

[bib.bib2] [2] Amihood Amir, Moshe Lewenstein, and Noa Lewenstein. Pattern matching in hypertext. Journal of Algorithms, 35(1):82–99, 2000. doi:10.1006/jagm.1999.1063.

[bib.bib3] [3] Yuichi Asahiro, Hiroshi Eto, Mingyang Gong, Jesper Jansson, Guohui Lin, Eiji Miyano, Hirotaka Ono, and Shunichi Tanaka. Approximation algorithms for the longest run subsequence problem. In Laurent Bulteau and Zsuzsanna Lipták, editors, 34th Annual Symposium on Combinatorial Pattern Matching, CPM 2023, June 26-28, 2023, Marne-la-Vallée, France, volume 259 of LIPIcs, pages 2:1–2:12. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPICS.CPM.2023.2.

[bib.bib4] [4] Bengt Aspvall, Michael F. Plass, and Robert Endre Tarjan. A linear-time algorithm for testing the truth of certain quantified boolean formulas. Inf. Process. Lett., 8(3):121–123, 1979. doi:10.1016/0020-0190(79)90002-4.

[bib.bib5] [5] Jasmijn A. Baaijens, Paola Bonizzoni, Christina Boucher, Gianluca Della Vedova, Yuri Pirola, Raffaella Rizzi, and Jouni Sirén. Computational graph pangenomics: a tutorial on data structures and their applications. Nat. Comput., 21(1):81–108, 2022. doi:10.1007/S11047-022-09882-6.

[bib.bib6] [6] Riccardo Dondi, Giancarlo Mauri, and Italo Zoppis. On the complexity of approximately matching a string to a directed graph. Inf. Comput., 288:104748, 2022. doi:10.1016/J.IC.2021.104748.

[bib.bib7] [7] Riccardo Dondi and Florian Sikora. The longest run subsequence problem: Further complexity results. In Pawel Gawrychowski and Tatiana Starikovskaya, editors, 32nd Annual Symposium on Combinatorial Pattern Matching, CPM 2021, July 5-7, 2021, Wrocław, Poland, volume 191 of LIPIcs, pages 14:1–14:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPICS.CPM.2021.14.

[bib.bib8] [8] Massimo Equi, Veli Mäkinen, Alexandru I. Tomescu, and Roberto Grossi. On the complexity of string matching for graphs. ACM Trans. Algorithms, 19(3):21:1–21:25, 2023. doi:10.1145/3588334.

[bib.bib9] [9] Paul Erdös and George Szekeres. A combinatorial problem in geometry. Compositio mathematica, 2:463–470, 1935.

[bib.bib10] [10] Michael R. Fellows, Danny Hermelin, Frances A. Rosamond, and Stéphane Vialette. On the parameterized complexity of multiple-interval graph problems. Theor. Comput. Sci., 410(1):53–61, 2009. doi:10.1016/J.TCS.2008.09.065.

[bib.bib11] [11] Michael R Garey and David S Johnson. Computers and intractability, volume 174. freeman San Francisco, 1979.

[bib.bib12] [12] Johan Håstad. Some optimal inapproximability results. J. ACM, 48(4):798–859, 2001. doi:10.1145/502090.502098.

[bib.bib13] [13] Chirag Jain, Haowen Zhang, Yu Gao, and Srinivas Aluru. On the complexity of sequence to graph alignment. In Lenore J. Cowen, editor, Research in Computational Molecular Biology - 23rd Annual International Conference, RECOMB 2019, Washington, DC, USA, May 5-8, 2019, Proceedings, volume 11467 of Lecture Notes in Computer Science, pages 85–100. Springer, 2019. doi:10.1007/978-3-030-17083-7_6.

[bib.bib14] [14] David R. Karger, Rajeev Motwani, and G. D. S. Ramkumar. On approximating the longest path in a graph. Algorithmica, 18(1):82–98, 1997. doi:10.1007/BF02523689.

[bib.bib15] [15] Manuel Lafond, Wenfeng Lai, Adiesha Liyanage, and Binhai Zhu. The longest subsequence-repeated subsequence problem. In Weili Wu and Jianxiong Guo, editors, Combinatorial Optimization and Applications - 17th International Conference, COCOA 2023, Hawaii, HI, USA, December 15-17, 2023, Proceedings, Part I, volume 14461 of Lecture Notes in Computer Science, pages 446–458. Springer, 2023. doi:10.1007/978-3-031-49611-0_32.

[bib.bib16] [16] Wenfeng Lai, Adiesha Liyanage, Binhai Zhu, and Peng Zou. The longest letter-duplicated subsequence and related problems. Acta Informatica, 61(3):315–329, 2024. doi:10.1007/S00236-024-00459-7.

[bib.bib17] [17] Udi Manber and Sun Wu. Approximate string matching with arbitrary cost for text and hypertext. In Advances in Structural and Syntactic Pattern Recognition, pages 22–33, 1992. doi:10.1142/9789812797919_0002.

[bib.bib18] [18] Gonzalo Navarro. Improved approximate pattern matching on hypertext. Theoretical Computer Science, 237(1-2):455–463, 2000. doi:10.1016/S0304-3975(99)00333-3.

[bib.bib19] [19] Ngan Nguyen, Glenn Hickey, Daniel R. Zerbino, Brian J. Raney, Dent Earl, Joel Armstrong, W. James Kent, David Haussler, and Benedict Paten. Building a pan-genome reference for a population. Journal of Computational Biology, 22(5):387–401, 2015. doi:10.1089/cmb.2014.0146.

[bib.bib20] [20] Kunsoo Park and Dong Kyue Kim. String matching in hypertext. In Zvi Galil and Esko Ukkonen, editors, Combinatorial Pattern Matching, 6th Annual Symposium, CPM 95, Espoo, Finland, July 5-7, 1995, Proceedings, volume 937 of Lecture Notes in Computer Science, pages 318–329. Springer, 1995. doi:10.1007/3-540-60044-2_51.

[bib.bib21] [21] Sven Schrinner, Manish Goel, Michael Wulfert, Philipp Spohr, Korbinian Schneeberger, and Gunnar W. Klau. Using the longest run subsequence problem within homology-based scaffolding. Algorithms Mol. Biol., 16(1):11, 2021. doi:10.1186/S13015-021-00191-8.

[bib.bib22] [22] Robert Sedgewick and Kevin Wayne. Algorithms (Fourth edition deluxe). Addison-Wesley, 2016.

[bib.bib23] [23] Steven Skiena. The Algorithm Design Manual, Third Edition. Texts in Computer Science. Springer, 2020. doi:10.1007/978-3-030-54256-6.

[bib.bib24] [24] Robert Endre Tarjan. Depth-first search and linear graph algorithms. SIAM J. Comput., 1(2):146–160, 1972. doi:10.1137/0201010.

[bib.bib25] [25] The Computational Pan-Genomics Consortium. Computational pan-genomics: status, promises and challenges. Briefings in Bioinformatics, 19(1):118–135, 2018. doi:10.1093/bib/bbw089.

[bib.bib26] [26] David P. Williamson and David B. Shmoys. The Design of Approximation Algorithms. Cambridge University Press, 2011. URL: http://www.cambridge.org/de/knowledge/isbn/item5759340/?site_locale=de_DE.

Representing Paths in Digraphs

Abstract

Keywords and phrases:

Copyright and License:

2012 ACM Subject Classification:

DOI:

Event:

Editors:

Series and Publisher:

1 Introduction

2 Preliminaries

Problem 1.

Problem 2.

3 Hardness

3.1 NP-completeness

Theorem 1.

Proof.

3.2 Hardness of Approximation

Corollary 2 (∗).

Problem 3.

Theorem 3 (∗).

Proof.

4 A Polynomial Time Algorithm for 𝚺-Representing Path when Each Symbol Labels at most Two Nodes

Definition 4.

Theorem 5 (∗).

5 Distance from Disjoint Paths

Problem 4.

Lemma 6.

Proof.

Lemma 7.

Proof.

Lemma 8 (∗).

Theorem 9 (∗).

Corollary 10 (∗).

6 An Approximation Algorithm

Theorem 11.

Proof.

7 Conclusion

References

Corollary 2 (^∗).

Theorem 3 (^∗).

4 A Polynomial Time Algorithm for $\Sigma$ -Representing Path when Each Symbol Labels at most Two Nodes

Theorem 5 (^∗).

Lemma 8 (^∗).

Theorem 9 (^∗).

Corollary 10 (^∗).