Document

**Published in:** LIPIcs, Volume 242, 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)

Combining a set of phylogenetic trees into a single phylogenetic network that explains all of them is a fundamental challenge in evolutionary studies. In this paper, we apply the recently-introduced theoretical framework of cherry picking to design a class of heuristics that are guaranteed to produce a network containing each of the input trees, for practical-size datasets. The main contribution of this paper is the design and training of a machine learning model that captures essential information on the structure of the input trees and guides the algorithms towards better solutions. This is one of the first applications of machine learning to phylogenetic studies, and we show its promise with a proof-of-concept experimental study conducted on both simulated and real data consisting of binary trees with no missing taxa.

Giulia Bernardini, Leo van Iersel, Esther Julien, and Leen Stougie. Reconstructing Phylogenetic Networks via Cherry Picking and Machine Learning. In 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 242, pp. 16:1-16:22, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

@InProceedings{bernardini_et_al:LIPIcs.WABI.2022.16, author = {Bernardini, Giulia and van Iersel, Leo and Julien, Esther and Stougie, Leen}, title = {{Reconstructing Phylogenetic Networks via Cherry Picking and Machine Learning}}, booktitle = {22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)}, pages = {16:1--16:22}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-243-3}, ISSN = {1868-8969}, year = {2022}, volume = {242}, editor = {Boucher, Christina and Rahmann, Sven}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2022.16}, URN = {urn:nbn:de:0030-drops-170507}, doi = {10.4230/LIPIcs.WABI.2022.16}, annote = {Keywords: Phylogenetics, Hybridization, Cherry Picking, Machine Learning, Heuristic} }

Document

**Published in:** LIPIcs, Volume 223, 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022)

A directed multigraph is called Eulerian if it has a circuit which uses each edge exactly once. Euler’s theorem tells us that a weakly connected directed multigraph is Eulerian if and only if every node is balanced. Given a collection S of strings over an alphabet Σ, the de Bruijn graph (dBG) of order k of S is a directed multigraph G_{S,k}(V,E), where V is the set of length-(k-1) substrings of the strings in S, and G_{S,k} contains an edge (u,v) with multiplicity m_{u,v}, if and only if the string u[0]⋅ v is equal to the string u⋅ v[k-2] and this string occurs exactly m_{u,v} times in total in strings in S. Let G_{Σ,k}(V_{Σ,k},E_{Σ,k}) be the complete dBG of Σ^k. The Eulerian Extension (EE) problem on G_{S,k} asks to extend G_{S,k} with a set ℬ of nodes from V_{Σ,k} and a smallest multiset 𝒜 of edges from E_{Σ,k} to make it Eulerian. Note that extending dBGs is algorithmically much more challenging than extending general directed multigraphs because some edges in dBGs are by definition forbidden. Extending dBGs lies at the heart of sequence assembly [Medvedev et al., WABI 2007], one of the most important tasks in bioinformatics. The novelty of our work with respect to existing works is that we allow not only to duplicate existing edges of G_{S,k} but to also add novel edges and nodes, in an effort to (i) connect multiple components and (ii) reduce the total EE cost. It is easy to show that EE on G_{S,k} is NP-hard via a reduction from shortest common superstring. We further show that EE remains NP-hard, even when we are not allowed to add new nodes, via a highly non-trivial reduction from 3-SAT. We thus investigate the following two problems underlying EE in dBGs:
1) When G_{S,k} is not weakly connected, we are asked to connect its d > 1 components using a minimum-weight spanning tree, whose edges are paths on the underlying G_{Σ,k} and weights are the corresponding path lengths. This way of connecting guarantees that no new unbalanced node is added. We show that this problem can be solved in 𝒪(|V|klog d+|E|) time, which is nearly optimal, since the size of G_{S,k} is Θ(|V|k+|E|).
2) When G_{S,k} is not balanced, we are asked to extend G_{S,k} to H_{S,k}(V∪ℬ,E∪𝒜) such that every node of H_{S,k} is balanced and the total number |𝒜| of added edges is minimized. We show that this problem can be solved in the optimal 𝒪(k|V| + |E|+ |𝒜|) time. Let us stress that, although our main contributions are theoretical, the algorithms we design for the above two problems are practical. We combine the two algorithms in one method that makes any dBG Eulerian; and show experimentally that the cost of the obtained feasible solutions on real-world dBGs is substantially smaller than the corresponding cost obtained by existing greedy approaches.

Giulia Bernardini, Huiping Chen, Grigorios Loukides, Solon P. Pissis, Leen Stougie, and Michelle Sweering. Making de Bruijn Graphs Eulerian. In 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 223, pp. 12:1-12:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

@InProceedings{bernardini_et_al:LIPIcs.CPM.2022.12, author = {Bernardini, Giulia and Chen, Huiping and Loukides, Grigorios and Pissis, Solon P. and Stougie, Leen and Sweering, Michelle}, title = {{Making de Bruijn Graphs Eulerian}}, booktitle = {33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022)}, pages = {12:1--12:18}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-234-1}, ISSN = {1868-8969}, year = {2022}, volume = {223}, editor = {Bannai, Hideo and Holub, Jan}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2022.12}, URN = {urn:nbn:de:0030-drops-161391}, doi = {10.4230/LIPIcs.CPM.2022.12}, annote = {Keywords: string algorithms, graph algorithms, Eulerian graph, de Bruijn graph} }

Document

**Published in:** LIPIcs, Volume 191, 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)

We consider the problem of constructing strings over an alphabet Σ that start with a given prefix u, end with a given suffix v, and avoid occurrences of a given set of forbidden substrings. In the decision version of the problem, given a set S_k of forbidden substrings, each of length k, over Σ, we are asked to decide whether there exists a string x over Σ such that u is a prefix of x, v is a suffix of x, and no s ∈ S_k occurs in x. Our first result is an 𝒪(|u|+|v|+k|S_k|)-time algorithm to decide this problem. In the more general optimization version of the problem, given a set S of forbidden arbitrary-length substrings over Σ, we are asked to construct a shortest string x over Σ such that u is a prefix of x, v is a suffix of x, and no s ∈ S occurs in x. Our second result is an 𝒪(|u|+|v|+||S||⋅|Σ|)-time algorithm to solve this problem, where ||S|| denotes the total length of the elements of S.
Interestingly, our results can be directly applied to solve the reachability and shortest path problems in complete de Bruijn graphs in the presence of forbidden edges or of forbidden paths.
Our algorithms are motivated by data privacy, and in particular, by the data sanitization process. In the context of strings, sanitization consists in hiding forbidden substrings from a given string by introducing the least amount of spurious information. We consider the following problem. Given a string w of length n over Σ, an integer k, and a set S_k of forbidden substrings, each of length k, over Σ, construct a shortest string y over Σ such that no s ∈ S_k occurs in y and the sequence of all other length-k fragments occurring in w is a subsequence of the sequence of the length-k fragments occurring in y. Our third result is an 𝒪(nk|S_k|⋅|Σ|)-time algorithm to solve this problem.

Giulia Bernardini, Alberto Marchetti-Spaccamela, Solon P. Pissis, Leen Stougie, and Michelle Sweering. Constructing Strings Avoiding Forbidden Substrings. In 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 191, pp. 9:1-9:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{bernardini_et_al:LIPIcs.CPM.2021.9, author = {Bernardini, Giulia and Marchetti-Spaccamela, Alberto and Pissis, Solon P. and Stougie, Leen and Sweering, Michelle}, title = {{Constructing Strings Avoiding Forbidden Substrings}}, booktitle = {32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)}, pages = {9:1--9:18}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-186-3}, ISSN = {1868-8969}, year = {2021}, volume = {191}, editor = {Gawrychowski, Pawe{\l} and Starikovskaya, Tatiana}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2021.9}, URN = {urn:nbn:de:0030-drops-139604}, doi = {10.4230/LIPIcs.CPM.2021.9}, annote = {Keywords: string algorithms, forbidden strings, de Bruijn graphs, data sanitization} }

Document

**Published in:** LIPIcs, Volume 191, 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)

Let W be a string of length n over an alphabet Σ, k be a positive integer, and 𝒮 be a set of length-k substrings of W. The ETFS problem (Edit distance, Total order, Frequency, Sanitization) asks us to construct a string X_ED such that: (i) no string of 𝒮 occurs in X_ED; (ii) the order of all other length-k substrings over Σ (and thus the frequency) is the same in W and in X_ED; and (iii) X_ED has minimal edit distance to W. When W represents an individual’s data and 𝒮 represents a set of confidential patterns, the ETFS problem asks for transforming W to preserve its privacy and its utility [Bernardini et al., ECML PKDD 2019].
ETFS can be solved in 𝒪(n²k) time [Bernardini et al., CPM 2020]. The same paper shows that ETFS cannot be solved in 𝒪(n^{2-δ}) time, for any δ > 0, unless the Strong Exponential Time Hypothesis (SETH) is false. Our main results can be summarized as follows:
- An 𝒪(n²log²k)-time algorithm to solve ETFS.
- An 𝒪(n²log²n)-time algorithm to solve AETFS (Arbitrary lengths, Edit distance, Total order, Frequency, Sanitization), a generalization of ETFS in which the elements of 𝒮 can have arbitrary lengths. Our algorithms are thus optimal up to subpolynomial factors, unless SETH fails.
In order to arrive at these results, we develop new techniques for computing a variant of the standard dynamic programming (DP) table for edit distance. In particular, we simulate the DP table computation using a directed acyclic graph in which every node is assigned to a smaller DP table. We then focus on redundancy in these DP tables and exploit a tabulation technique according to dyadic intervals to obtain an optimal alignment in 𝒪̃(n²) total time. Beyond string sanitization, our techniques may inspire solutions to other problems related to regular expressions or context-free grammars.

Takuya Mieno, Solon P. Pissis, Leen Stougie, and Michelle Sweering. String Sanitization Under Edit Distance: Improved and Generalized. In 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 191, pp. 19:1-19:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{mieno_et_al:LIPIcs.CPM.2021.19, author = {Mieno, Takuya and Pissis, Solon P. and Stougie, Leen and Sweering, Michelle}, title = {{String Sanitization Under Edit Distance: Improved and Generalized}}, booktitle = {32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)}, pages = {19:1--19:18}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-186-3}, ISSN = {1868-8969}, year = {2021}, volume = {191}, editor = {Gawrychowski, Pawe{\l} and Starikovskaya, Tatiana}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2021.19}, URN = {urn:nbn:de:0030-drops-139709}, doi = {10.4230/LIPIcs.CPM.2021.19}, annote = {Keywords: string algorithms, data sanitization, edit distance, dynamic programming} }

Document

**Published in:** LIPIcs, Volume 161, 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)

Let W be a string of length n over an alphabet Σ, k be a positive integer, and 𝒮 be a set of length-k substrings of W. The ETFS problem asks us to construct a string X_{ED} such that: (i) no string of 𝒮 occurs in X_{ED}; (ii) the order of all other length-k substrings over Σ is the same in W and in X_{ED}; and (iii) X_{ED} has minimal edit distance to W. When W represents an individual’s data and 𝒮 represents a set of confidential substrings, algorithms solving ETFS can be applied for utility-preserving string sanitization [Bernardini et al., ECML PKDD 2019]. Our first result here is an algorithm to solve ETFS in 𝒪(kn²) time, which improves on the state of the art [Bernardini et al., arXiv 2019] by a factor of |Σ|. Our algorithm is based on a non-trivial modification of the classic dynamic programming algorithm for computing the edit distance between two strings. Notably, we also show that ETFS cannot be solved in 𝒪(n^{2-δ}) time, for any δ>0, unless the strong exponential time hypothesis is false. To achieve this, we reduce the edit distance problem, which is known to admit the same conditional lower bound [Bringmann and Künnemann, FOCS 2015], to ETFS.

Giulia Bernardini, Huiping Chen, Grigorios Loukides, Nadia Pisanti, Solon P. Pissis, Leen Stougie, and Michelle Sweering. String Sanitization Under Edit Distance. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 7:1-7:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{bernardini_et_al:LIPIcs.CPM.2020.7, author = {Bernardini, Giulia and Chen, Huiping and Loukides, Grigorios and Pisanti, Nadia and Pissis, Solon P. and Stougie, Leen and Sweering, Michelle}, title = {{String Sanitization Under Edit Distance}}, booktitle = {31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)}, pages = {7:1--7:14}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-149-8}, ISSN = {1868-8969}, year = {2020}, volume = {161}, editor = {G{\o}rtz, Inge Li and Weimann, Oren}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.7}, URN = {urn:nbn:de:0030-drops-121324}, doi = {10.4230/LIPIcs.CPM.2020.7}, annote = {Keywords: String algorithms, data sanitization, edit distance, dynamic programming, conditional lower bound} }

Document

**Published in:** LIPIcs, Volume 40, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2015)

Users of cloud computing services are offered rapid access to computing resources via the Internet. Cloud providers use different pricing options such as (i) time slot reservation in advance at a fixed price and (ii) on-demand service at a (hourly) pay-as-used basis. Choosing the best combination of pricing options is a challenging task for users, in particular, when the instantiation of computing jobs underlies uncertainty.
We propose a natural model for two-stage scheduling under uncertainty that captures such resource provisioning and scheduling problem in the cloud. Reserving a time unit for processing jobs incurs some cost, which depends on when the reservation is made: a priori decisions, based only on distributional information, are much cheaper than on-demand decisions when the actual scenario is known. We consider both stochastic and robust versions of scheduling unrelated machines with objectives of minimizing the sum of weighted completion times and the makespan. Our main contribution is an (8+eps)-approximation algorithm for the min-sum objective for the stochastic polynomial-scenario model. The same technique gives a (7.11+eps)-approximation for minimizing the makespan. The key ingredient is an LP-based separation of jobs and time slots to be considered in either the first or the second stage only, and then approximately solving the separated problems. At the expense of another epsilon our results hold for any arbitrary scenario distribution given by means of a black-box. Our techniques also yield approximation algorithms for robust two-stage scheduling.

Lin Chen, Nicole Megow, Roman Rischke, and Leen Stougie. Stochastic and Robust Scheduling in the Cloud. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2015). Leibniz International Proceedings in Informatics (LIPIcs), Volume 40, pp. 175-186, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2015)

Copy BibTex To Clipboard

@InProceedings{chen_et_al:LIPIcs.APPROX-RANDOM.2015.175, author = {Chen, Lin and Megow, Nicole and Rischke, Roman and Stougie, Leen}, title = {{Stochastic and Robust Scheduling in the Cloud}}, booktitle = {Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2015)}, pages = {175--186}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-939897-89-7}, ISSN = {1868-8969}, year = {2015}, volume = {40}, editor = {Garg, Naveen and Jansen, Klaus and Rao, Anup and Rolim, Jos\'{e} D. P.}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.APPROX-RANDOM.2015.175}, URN = {urn:nbn:de:0030-drops-53028}, doi = {10.4230/LIPIcs.APPROX-RANDOM.2015.175}, annote = {Keywords: Approximation Algorithms, Robust Optimization, Stochastic Optimization, Unrelated Machine Scheduling, Cloud Computing} }

Document

**Published in:** LIPIcs, Volume 1, 25th International Symposium on Theoretical Aspects of Computer Science (2008)

We address the problem of efficient data gathering in a wireless
network through multi-hop communication. We focus on the objective
of minimizing the maximum flow time of a data packet. We prove
that no polynomial time algorithm for this problem can have
approximation ratio less than $Omega(m^{1/3)$ when $m$ packets
have to be transmitted, unless $P = NP$. We then use resource
augmentation to assess the performance of a FIFO-like strategy. We
prove that this strategy is 5-speed optimal, i.e., its cost remains
within the optimal cost if we allow the algorithm to transmit data
at a speed 5 times higher than that of the optimal solution we
compare to.

Vincenzo Bonifaci, Peter Korteweg, Alberto Marchetti-Spaccamela, and Leen Stougie. Minimizing Flow Time in the Wireless Gathering Problem. In 25th International Symposium on Theoretical Aspects of Computer Science. Leibniz International Proceedings in Informatics (LIPIcs), Volume 1, pp. 109-120, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2008)

Copy BibTex To Clipboard

@InProceedings{bonifaci_et_al:LIPIcs.STACS.2008.1338, author = {Bonifaci, Vincenzo and Korteweg, Peter and Marchetti-Spaccamela, Alberto and Stougie, Leen}, title = {{Minimizing Flow Time in the Wireless Gathering Problem}}, booktitle = {25th International Symposium on Theoretical Aspects of Computer Science}, pages = {109--120}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-939897-06-4}, ISSN = {1868-8969}, year = {2008}, volume = {1}, editor = {Albers, Susanne and Weil, Pascal}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.STACS.2008.1338}, URN = {urn:nbn:de:0030-drops-13381}, doi = {10.4230/LIPIcs.STACS.2008.1338}, annote = {Keywords: Wireless networks, data gathering, approximation algorithms, distributed algorithms} }

Document

**Published in:** Dagstuhl Seminar Proceedings, Volume 5031, Algorithms for Optimization with Incomplete Information (2005)

This paper adresses facility location under uncertain demand. The problem is to determine the optimal location of facilities and allocation of uncertain customer demand to these facilities. The costs of operating the facilities are subject to economies of scale. The objective is to minimize the total expected costs. These costs can be split into two parts: firstly the costs of investing in a facility as well as maintaining and operating it with strictly diminishing average costs, and secondly linear transportation cost. We formulate the problem as a two-stage stochastic programming model and present a solution method based on Lagrangian Relaxation. We also show some computional results based on data from the Norwegian meat industry regarding the location of slaughterhouses.

Peter Schütz, Leen Stougie, and Asgeir Tomasgard. Facility location with uncertain demand and economies of scale. In Algorithms for Optimization with Incomplete Information. Dagstuhl Seminar Proceedings, Volume 5031, pp. 1-11, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2005)

Copy BibTex To Clipboard

@InProceedings{schutz_et_al:DagSemProc.05031.11, author = {Sch\"{u}tz, Peter and Stougie, Leen and Tomasgard, Asgeir}, title = {{Facility location with uncertain demand and economies of scale}}, booktitle = {Algorithms for Optimization with Incomplete Information}, pages = {1--11}, series = {Dagstuhl Seminar Proceedings (DagSemProc)}, ISSN = {1862-4405}, year = {2005}, volume = {5031}, editor = {Susanne Albers and Rolf H. M\"{o}hring and Georg Ch. Pflug and R\"{u}diger Schultz}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops-dev.dagstuhl.de/entities/document/10.4230/DagSemProc.05031.11}, URN = {urn:nbn:de:0030-drops-1114}, doi = {10.4230/DagSemProc.05031.11}, annote = {Keywords: facility location , stochastic , economies of scale} }

X

Feedback for Dagstuhl Publishing

Feedback submitted

Please try again later or send an E-mail