License: Creative Commons Attribution 4.0 International license (CC BY 4.0)
When quoting this document, please refer to the following
DOI: 10.4230/LIPIcs.WABI.2022.24
URN: urn:nbn:de:0030-drops-170586
Go to the corresponding LIPIcs Volume Portal

Rubert, Diego P. ; Braga, Marília D. V.

Gene Orthology Inference via Large-Scale Rearrangements for Partially Assembled Genomes

LIPIcs-WABI-2022-24.pdf (1 MB)


Recently we developed a gene orthology inference tool based on genome rearrangements (Journal of Bioinformatics and Computational Biology 19:6, 2021). Given a set of genomes our method first computes all pairwise gene similarities. Then it runs pairwise ILP comparisons to compute optimal gene matchings, which minimize, by taking the similarities into account, the weighted rearrangement distance between the analyzed genomes (a problem that is NP-hard). The gene matchings are then integrated into gene families in the final step. Although the ILP is quite efficient and could conceptually analyze genomes that are not completely assembled but split in several contigs, our tool failed in completing that task. The main reason is that each ILP pairwise comparison includes an optimal capping that connects each end of a linear segment of one genome to an end of a linear segment in the other genome, producing an exponential increase of the search space.
In this work, we design and implement a heuristic capping algorithm that replaces the optimal capping by clustering (based on their gene content intersections) the linear segments into m ≥ 1 subsets, whose ends are capped independently. Furthermore, in each subset, instead of allowing all possible connections, we let only the ends of content-related segments be connected. Although there is no guarantee that m is much bigger than one, and with the possible side effect of resulting in sub-optimal instead of optimal gene matchings, the heuristic works very well in practice, from both the speed performance and the quality of computed solutions. Our experiments on real data show that we can now efficiently analyze fruit fly genomes with unfinished assemblies distributed in hundreds or even thousands of contigs, obtaining orthologies that are more similar to FlyBase orthologies when compared to orthologies computed by other inference tools. Moreover, for complete assemblies the version with heuristic capping reports orthologies that are very similar to the orthologies computed by the optimal version of our tool. Our approach is implemented into a pipeline incorporating the pre-computation of gene similarities.

BibTeX - Entry

  author =	{Rubert, Diego P. and Braga, Mar{\'\i}lia D. V.},
  title =	{{Gene Orthology Inference via Large-Scale Rearrangements for Partially Assembled Genomes}},
  booktitle =	{22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)},
  pages =	{24:1--24:22},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-243-3},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{242},
  editor =	{Boucher, Christina and Rahmann, Sven},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{},
  URN =		{urn:nbn:de:0030-drops-170586},
  doi =		{10.4230/LIPIcs.WABI.2022.24},
  annote =	{Keywords: Comparative genomics, double-cut-and-join, indels, gene orthology}

Keywords: Comparative genomics, double-cut-and-join, indels, gene orthology
Collection: 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)
Issue Date: 2022
Date of publication: 26.08.2022
Supplementary Material: Both the original version with optimal capping and the new modified version with heuristic capping can be downloaded from our GitLab server at:
Software (Source Code):

DROPS-Home | Fulltext Search | Imprint | Privacy Published by LZI