Abstract 1 Introduction 2 Preliminaries 3 Simulating Finite Automata with Maximal Parallel Log-Hairpin Deletion 4 Minimizing Input NFA’s 5 Conclusion References Appendix A Proof of Correctness for the Construction in Theorem 5 Appendix B Formal Definitions of RB-NFAs and DB-NFAs Appendix C Additional Content Regarding the Example for Theorem 5 using the RNA Alphabet 𝚺𝑹𝑵𝑨={𝙰,𝚄,𝙲,𝙶}

Programmable Co-Transcriptional Splicing: Realizing Regular Languages via Hairpin Deletion

Da-Jung Cho ORCID Department of Software and Computer Engineering, Ajou University, Suwon, Republic of Korea Szilárd Zsolt Fazekas ORCID Graduate School of Engineering Science, Akita University, Japan Shinnosuke Seki ORCID University of Electro-Communications, Tokyo, Japan Max Wiedenhöft ORCID Department of Computer Science, Kiel University, Germany
Abstract

RNA co-transcriptionality, where RNA is spliced or folded during transcription from DNA templates, offers promising potential for molecular programming. It enables programmable folding of nanoscale RNA structures and has recently been shown to be Turing universal. While post-transcriptional splicing is well studied, co-transcriptional splicing is gaining attention for its efficiency, though its unpredictability still remains a challenge. In this paper, we focus on engineering co-transcriptional splicing, not only as a natural phenomenon but as a programmable mechanism for generating specific RNA target sequences from DNA templates. The problem we address is whether we can encode a set of RNA sequences for a given system onto a DNA template word, ensuring that all the sequences are generated through co-transcriptional splicing. Given that finding the optimal encoding has been shown to be NP-complete under the various energy models considered [4], we propose a practical alternative approach under the logarithmic energy model. More specifically, we provide a construction that encodes an arbitrary nondeterministic finite automaton (NFA) into a circular DNA template from which co-transcriptional splicing produces all sequences accepted by the NFA. As all finite languages can be efficiently encoded as NFA, this framework solves the problem of finding small DNA templates for arbitrary target sets of RNA sequences. The quest to obtain the smallest possible such templates naturally leads us to consider the problem of minimizing NFAs and certain practically motivated variants of it, but as we show, those minimization problems are computationally intractable.

Keywords and phrases:
RNA Transcription, Co-Transcriptional Splicing, Finite Automata Simulation, NFA Minimization
Copyright and License:
[Uncaptioned image] © Da-Jung Cho, Szilárd Zsolt Fazekas, Shinnosuke Seki, and Max Wiedenhöft; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Applied computing Bioinformatics
Related Version:
Full Version: https://arxiv.org/abs/2506.23384
Funding:
Da-Jung Cho was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) under the Artificial Intelligence Convergence Innovation Human Resources Development (IITP-2025-RS-2023-00255968) grant funded by the Korea government (MSIT). Szilárd Zsolt Fazekas was supported by JSPS Kakenhi Grant No. 23K10976. Max Wiedenhöft’s work was partially supported by the DFG project number 437493335.
Editors:
Josie Schaeffer and Fei Zhang

1 Introduction

RNA splicing is a fundamental process in eukaryotic gene expression, where intronic sequences are removed from precursor messenger RNA (pre-mRNA) transcripts, and the remaining exonic sequences are ligated together to form a mature messenger RNA (mRNA). This essential process is mediated by the spliceosome, a dynamic and large macromolecular complex consisting of small nuclear RNAs (snRNAs) and associated proteins. The spliceosome assembles on the pre-mRNA in a highly conserved and sequential manner. The assembly involves the sequential recruitment of specific snRNPs (small nuclear ribonucleoproteins) and various factors that play critical roles in splice site recognition and catalysis. The process begins when the 5’-splice site (5’-SS) is identified by the polymerase-spliceosome complex. This site is then kept in close proximity to the transcribed region of the RNA, awaiting the hybridization event with the complementary sequence at the 3’-splice site (3’-SS) after subsequent transcription of the RNA. Once the 3’-SS is transcribed, the intronic sequence is excised, and the two exons are ligated together to form the mature RNA. Splicing takes place while transcription is still ongoing, meaning that the spliceosome assembles and acts on the RNA as it is being made. This close coordination between transcription and splicing is known as co-transcriptional splicing. Increasing biochemical and genomic evidence has established that spliceosome assembly and even catalytic steps of splicing frequently occur co-transcriptionally [16]. As shown in Figure 1, co-transcriptional splicing involves the spliceosome’s early recognition and action on splice sites while the RNA is still being transcribed. This coupling imposes temporal and structural constraints on splice site selection and nascent RNA folding, since spliceosome assembly and splicing catalysis occur during ongoing transcription.

Refer to caption
Figure 1: (a) An illustration of co-transcriptional splicing: The process begins when the 5’-splice site (5’-SS) is identified by the polymerase-spliceosome complex. This site is then kept in close proximity to the transcribed region of the RNA, awaiting the hybridization event with the complementary sequence at the 3’-splice site (3’-SS) after subsequent transcription of the RNA. The transcript forms a hairpin where the 5’-SS binds the branch point. After the 3’-SS is transcribed the spliceosome removes the intron and joins the exons. (b) A string representation of the DNA sequence in (a), factorized according to the scheme of co-transcriptional splicing, as used in Definition 1.

While co-transcriptional splicing has been studied as a natural phenomenon, it is increasingly being recognized as a programmable and engineerable process. Indeed, Geary, Rothemund, and Andersen have proved that one of such processes called co-transcriptional folding is programmable to assemble nanoscale single-stranded RNA structures in vitro [9] and then Seki, one of the authors, proved in collaboration with Geary and others that it is Turing universal by using an oritatami model [8]. Recently, we became curious about the potential of using RNA sequences as programmable components in molecular systems. Motivated by advances in co-transcriptional RNA folding and splicing, this raises the question: Can we encode a set of RNA sequences for a given system onto a single DNA template such that all the sequences are generated through co-transcriptional splicing, while avoiding sequences that may disrupt the system? As transcription proceeds and the 3’-SS is transcribed, RNA polymerase often pauses just downstream, allowing time for accurate splice site recognition and splicing to take place. The folding of the emerging RNA strand can influence this process. Motivated by this perspective, we introduced a formal model of co-transcriptional splicing [4], defined as contextual lariat deletion under various biologically plausible conditions (energy models). For the sake of simplicity and to avoid confusion with the rather different but similarly named contextual deletion [14], we will refer to the operation here as hairpin deletion. We showed that the template construction problem is NP-complete when |Σ|4. The hardness of this problem led us to consider an alternative approach for the template construction problem. A natural observation is that the DNA template word can be obtained by finding the shortest common supersequence of the RNA target sequences. However, this problem is also known to be NP-complete [7, 15]. The contributions of this paper are as follows:

  • (Parallel) Logarithmically-bounded hairpin deletion operation: In natural RNA folding, the destabilizing energy contributed by loop regions in hairpin structures increases logarithmically with loop size. While small loops incur significant energy penalties, this effect diminishes as loop length increases [6]. This behavior supports the logarithmic-bounded hairpin model, where the destabilization caused by a loop of length is proportional to log(). Our model exploits this property by favoring hairpin structures with long loops and short stems, which remain energetically viable under the logarithmic cost function. Furthermore, we introduce a parallel deletion model to represent this consecutive splicing behavior since co-transcriptional splicing tends to proceed consecutively rather than recursively due to kinetic and structural constraints during transcription.

  • Template word construction by encoding RNA target sequences to a finite automata on a circular DNA template word: In Section 3, our main result shows, if we consider a set of target RNA sequences as a regular language represented by some (non)deterministic finite automaton, we can construct a DNA template from which we can obtain exactly that language using the logarithmic hairpin deletion model. The construction operates on a circular DNA template, which is widely used in molecular biology, for instance in plasmids and viral genomes [5, 9], and supports continuous transcription. We first demonstrate how to construct such a circular DNA word that encodes an arbitrary NFA according with the behavior of logarithmic hairpin deletion. Then, a detailed construction for the RNA alphabet Σ𝚁𝙽𝙰={𝙰,𝚄,𝙲,𝙶} is also provided, illustrating how the states and transitions of the automaton are encoded. It is shown that this construction actually works for an appropriately designed DNA template word under the logarithmic hairpin deletion model.

  • Minimizing NFA-like models: As the size of the constructed template depends on the number of states and transitions in the automaton, minimizing the NFA becomes an important consideration. As the minimization problem for NFAs is NP-hard in general [12], in Section 4, we consider restricted NFA-like models that fit our setting. We show that minimization in these restricted cases is still computationally hard. This highlights a fundamental complexity barrier in optimizing DNA template designs for generating arbitrary regular languages of RNA sequences via co-transcriptional splicing. Therefore, it may be more effective to focus on designing specific classes of target sequences and their corresponding automata representations.

2 Preliminaries

Let denote the set of positive integers and let 0={0}. Let denote the set of integers. For some m, denote by [m] the set {1,2,,m} and let [m]0 = [m]{0}. With Σ, we denote a finite set of symbols, called an alphabet. The elements of Σ are called letters. A word over Σ is a finite sequence of letters from Σ. With ε, we denote the empty word. The set of all words over Σ is denoted by Σ. Additionally, set Σ+=Σ{ε}. Given some word wΣ, the word wω represents an infinite repetition of w, called circular word. The length of a word wΣ, i.e., the number letters in w, is denoted by |w|; hence, |ε|=0. For some wΣ, if we have w=xyz for some x,y,zΣ, then we say that x is a prefix, y is a factor, and z is a suffix from w. We denote the set of all factors, prefixes, and suffixes of w by Fact(w), Pref(w), and Suff(w) respectively. If xw, yw, or zw respectively, we call the respective prefix, factor, or suffix proper. For some word w=xy, for x,yΣ, we define wy1=x as well as x1w=y. A function θ:ΣΣ is called an antimorphic involution if θ(θ(𝚊))=𝚊 for all 𝚊Σ and θ(w𝚋)=θ(𝚋)θ(w) for all wΣ and 𝚋Σ. Watson-Crick complementarity, which primarily governs hybridization among DNA and RNA sequences, has been modeled as an antimorphic involution that maps A to T (U), C to G, G to C, and T (U) to A. A nondeterministic finite automaton (NFA) is a tuple A=(Σ,Q,q0,δ,F) where Σ is the input alphabet, Q is the finite set of states, δ:Q×Σ2Q is the multivalued transition function, q0Q is the initial state, and FQ is the set of final states. In the usual way, δ is extended as a function Q×Σ2Q and the language accepted by A is L(A)={wΣδ(q0,w)F}. The automaton A is a deterministic finite automaton (DFA) if δ is a single valued partial function. It is well known that deterministic and nondeterministic finite automata recognize the class of regular languages [11]. Finally, for a regular expression r, writing it down, immediately refers to its language, e.g., writing 𝚊(𝚊|𝚋)𝚋 refers to the set {w{𝚊,𝚋}w[1]=𝚊,w[|w|]=𝚋}.

2.1 Hairpin deletion

Hairpin deletion is a formal operation introduced in [4] with the purpose of modeling co-transcriptional splicing in RNA transcription. The computational power of the operation in various energy models (length restrictions between the stem and the loop) has been thoroughly investigated in [4]. Here we introduce only the elements relevant to our results (in the logarithmic energy model). A pair of words, (u,v)Σ×Σ, serves as a context; u and v are particularly called left and right contexts. A set C of such pairs is called a context set. Hairpins are an atom of DNA and RNA structures. They can be modeled as xθ(x) for words x,Σ, where x=b1b2bn and its Watson-Crick complement θ(x)=θ(bn)θ(b2)θ(b1) hybridize with each other via hydrogen bonds between bases bi and θ(bi) into a stem, and serves as a loop (in reality ||3 is required) [13]. Assume Σ to be some alphabet, wΣ to be some word over that alphabet, and θ to be some antimorphic involution on Σ. We call w a hairpin if w=xθ(x), for some x,Σ. Furthermore, let c be some positive number. We call w a logarithmically-bounded hairpin (log-hairpin) if |x|clog(). We denote the set of all hairpins regarding Σ and θ by H(Σ,θ) and the sets of all log-hairpins regarding Σ, θ, and c by Hlog(Σ,θ,c). For readability purposes, we often write just H or Hlog and infer Σ, θ and c by the context. Notice that the logarithmic penalty of in log-hairpins allows for an exponential size of the loop with respect to the length of the stem.

Using the notion of hairpins and log-hairpins, we can continue with the definition of bounded hairpin deletion. Splicing relies on the occurrence of left and right contexts, the formation of a stable hairpin containing the left context, and a potential margin between the hairpin and the occurrence of the right context. As only the logarithmic energy model is relevant in this paper, we focus only on log-hairpins. Figure 1(b) shows a string factorization where the logarithmic bounded hairpin deletion operation can be applied, reflecting the co-transcriptional splicing process.

Definition 1.

Let S=(Σ,θ,c,C,n) be a tuple of parameters, with Σ an alphabet, wΣ a word, C a context-set, θ an antimorphic involution, n a constant called margin, and c a multiplicative factor used in the logarithmic energy model definition. If

w=wpαxθ(x)θ(α)zβws

for some (α,β)C and wp,ws,x,,zΣ, then we say that wpws is obtainable by logarithmic bounded hairpin deletion (or just log-hairpin deletion) from w over S (denoted w[Uncaptioned image]𝚕𝚘𝚐Swpws,) if and only if αxθ(xα)Hlog and |z|n.

For example, if θ is defined to represent the Watson-Crick complement, the context-set is set to C={(𝙰𝚄𝙰,𝙲𝙲𝙲)}, the margin is set to n=1, the log-factor is set to c=1, and w=𝙰𝙰𝚄𝙰𝙰𝙲𝙲𝚄𝚄𝙰𝚄𝙶𝙲𝙲𝙲𝙶𝙰, then we have

𝙰𝙰𝚄𝙰𝙰𝙲𝙲𝚄𝚄𝙰𝚄𝙶𝙲𝙲𝙲𝙶𝙰[Uncaptioned image]𝚕𝚘𝚐𝙰𝙶𝙰

by wp=𝙰, α=𝙰𝚄𝙰, x=𝙰, =𝙲𝙲, θ(x)θ(α)=𝚄𝚄𝙰𝚄, z=𝙶, β=𝙲𝙲𝙲, and ws=𝙶𝙰.

Given some input word or some input language, we can consider the language of all words obtained if hairpin deletion is allowed to be applied multiple times. In practice, it is observed that co-transcriptional splicing occurs in a consecutive manner on the processed DNA sequence, meaning that newly created splicing sites resulting from earlier splicing operations do not affect the outcome. Hence, we introduce a parallel deletion model in which only non-overlapping hairpins present in the sequence at the beginning can be deleted, representing consecutive co-transcriptional splicing.

Definition 2.

Let wΣ be a word and let S be a tuple of parameters for log-hairpin deletion. For some wΣ, we call it obtainable by parallel log-hairpin deletion of w over S, denoted w[Uncaptioned image]𝚙𝚕𝚘𝚐Sw, if there exist m and u1,α1,,um,αm,um+1Σ such that

w=u1α1u2α2umαmum+1,
w=u1u2umum+1,

and, for all i[m], we have αi[Uncaptioned image]𝚕𝚘𝚐Sε, i.e., each αi is removed by log-hairpin deletion. For readability purposes, we may just write [Uncaptioned image] and infer S or 𝚕𝚘𝚐 by context.

For example, given the context set C={(𝙰𝙰,𝙲𝙲),(𝙲𝙲,𝙲𝙲)}, θ defined by θ(𝙰)=𝚄 and θ(𝙲)=𝙶, and the word w=𝙲𝙰𝙶𝙰𝙶𝚄𝙲𝚄𝙶𝙲𝙲𝙰𝙰𝙶𝙶𝙶, we could delete two factors at once:

𝙲𝙰𝙰𝙰𝙶𝚄𝚄𝙲𝙲𝚄𝙶𝙲𝙲𝙰𝙰𝙶𝙶𝙲𝙲𝙶[Uncaptioned image]𝚙𝙲𝙰𝚄𝙶𝙶

Applying parallel log-hairpin deletion to a word or a given language can produce another language. Hence, given some wΣ or LΣ, the parallel log-hairpin deletion sets over S of w (or L) are defined by

[w][Uncaptioned image]𝚙𝚕𝚘𝚐S={wΣw[Uncaptioned image]𝚕𝚘𝚐Sw} (or [L][Uncaptioned image]𝚙𝚕𝚘𝚐=wL[w][Uncaptioned image]𝚙𝚕𝚘𝚐).

In practice and for the purpose of the following constructions in the main body of the paper, we define a variant of the parallel hairpin deletion that uses certain necessary assumptions about the words contained in the language. First of all, if there exist isolated 5’SS (left contexts) in a word that is read from left to right, there is a high chance that it is used in the formation of the hairpin if a corresponding 3’SS exists [2, 17]. In that sense, we assume some greediness when it comes to selecting left contexts. Similarly, we could assume the same about the choice of the right context. For the construction that follows, only the first assumption is necessary. If there exist overlapping 5’SS in the word, thermodynamics might result in some nondeterministic choice of the actual 5’SS that is used [18, 19]. To obtain a model that follows these constraints, we introduce a variant of parallel deletion, called maximally parallel deletion, in which it is assumed that no left contexts survive in each part ui.

Definition 3.

Let wΣ be a word and S be some tuple of parameters for log-hairpin deletion. For some wΣ, we call it obtainable by maximally parallel log-hairpin deletion over S of w, denoted w[Uncaptioned image]𝚙-𝚕𝚘𝚐Sw, if there exists some m and ui,αi,um+1Σ, i[m], such that w=u1α1unαnun+1, w=u1un+1, bounded hairpin deletion cannot be applied to un+1, and, for all i[m], we have αi[Uncaptioned image]𝚕𝚘𝚐Sε as well as, for all (x,y)C, we have xFact(ui).

Again, we denote the set of all words obtainable by maximally parallel log-hairpin deletion by [w][Uncaptioned image]𝚙-𝚕𝚘𝚐S (analogously for input languages as before). Notice that [w][Uncaptioned image]𝚙-𝚕𝚘𝚐S[w][Uncaptioned image]𝚙𝚕𝚘𝚐S as it is a more restricted variant of the parallel log-hairpin deletion set. This concludes all necessary introductory terminology regarding the formal model of hairpin deletion.

3 Simulating Finite Automata with Maximal Parallel Log-Hairpin Deletion

This section investigates the possibility to obtain arbitrary regular languages of RNA sequences from a circular template DNA sequence using bounded hairpin deletion. We provide a construction that allows for the simulation of arbitrary (non)deterministic finite automata (DFA/NFA) using maximally parallel bounded hairpin deletion in the logarithmic energy model on circular DNA. In particular it is shown, given some NFA A, that we can construct a word w for which we have [wω][Uncaptioned image]𝚙-𝚕𝚘𝚐=L(A).

As an initial idea, each transition defined in some NFA A was encoded on a circular word in a consecutive matter, i.e., if for example A had the transitions (qi,𝚊,qj) and (qj,𝚋,q), then w would contain them directly as factors, resulting in a word w=(qi,𝚊,qj)(qj,𝚋,q). A context-set C can be defined to contain some context (α,β)C with α=,qj) and β=(qj, that would allow for the potential deletion of the factor ,qj)(qj, and resulting in a word w=(qi,𝚊𝚋,q, allowing for further parallel deletions. This model works quite intuitively, but resulted in various problems. For example, the controlled formation of hairpins with a stem and a loop posed a serious challenge. Also a controlled notion of termination was missing. Due to the above, we decided on a different approach. Now, instead of encoding all transitions in a parallel manner, we use a single factor si per state qi that allows a non-deterministic transition to jump to some other factor sj, encoding qj, using hairpin deletion. The general spirit, however, stays the same, as we are still moving around a circular DNA template and select transitions to be taken non-deterministically by the application of hairpin deletion, leaving only the letters labeling those transitions to remain in the resulting words.

First, we need some section-specific definitions, as we are now considering and utilizing infinite repetitions of some circular DNA template. As mentioned in Section 2, a circular word wω is an infinite repetition of some word wΣ. In order to extend the notion of hairpin deletion to infinite words, we need some notion of intentionally stopping the transcription process on infinite words.

In practice there are, among others, two different ways to stop the transcription process and detach the produced RNA from the polymerase. First, there is rho-dependent transcription termination which uses other molecules that bind to certain factors on the produced RNA [20]. Second, there is rho-independent transcription termination that is based on the formation of a hairpin of certain length followed by a specific factor on the template DNA [3]. The rho-dependent model would be easier to embed in our definition, but it relies on external factors to work. Hence, to obtain a process that works as independent from external factors as possible, we will use the latter, rho-independent, model.

Most of the time, after the final hairpin, a factor containing only a certain number of 𝚄’s/𝚃’s is enough to result in the decoupling of the produced RNA sequence from the polymerase. To obtain a general model, we allow for arbitrary words to be used at this point, elements of some TΣ, called the set of terminating-sequences. We extend the notion of log-hairpin deletion by adding T and a terminating stem-length m to the properties S, i.e., writing S=(Σ,θ,c,C,n,T,m) instead of S=(Σ,θ,c,C,n), if the processed word is a circular word wω over wΣ. The number m represents the minimal length of the hairpin or stem that must be formed as a suffix in the RNA produced, before reading a terminating-sequence tT in wω. Using this, we can adapt the notion of log-hairpin deletion for circular words by adding the notion of termination.

Definition 4.

Let wω be a circular word over wΣ. Let S=(Σ,θ,c,C,n,T,m) be a tuple of properties for log-hairpin deletion as defined before. Assume there exists a finite prefix wt of wω with wΣ and tT such that w[Uncaptioned image]Su, for some word uΣ. If u has a suffix s of length |s|=2m such that s forms a stem111The requirement that s is only a stem of length 2m and not a hairpin with a stem of such length, i.e., no loop is formed, is a technical one chosen for simplicity reasons. A long enough stem always allows for a loop to be formed. Hence, adapting the encoded suffix responsible for transcription termination in the following construction in the proof of Theorem 5 should also be possible by setting m short enough and adding enough letters to that suffix that do not interfere with the constructed context-set, i.e., s=xθ(x), for some xΣ, then we say that us1 is obtainable from wω by terminating log-hairpin deletion.

The remaining part of this section provides a general framework for constructing working DNA templates. Properties needed for the construction to work are provided. The following result is the main result of this section and contains the general construction of a template word w that allows for the simulation of arbitrary regular languages represented by finite automata using terminating maximally parallel bounded log-hairpin deletion. We provide a basic construction for NFAs which naturally follows for DFAs and potentially other finite automata models such as GNFAs or GDFAs (which allow sequences as transition labels instead of just single letters).

Theorem 5.

Let A=(Q,Σ,q1,δ,F) be some NFA. There exists a word wΣ and a properties tuple for log-hairpin deletion S=(Σ,θ,c,C,n,T,m) as defined before such that L(A)=[wω][Uncaptioned image]𝚙-𝚕𝚘𝚐S.

Proof.

Let A=(Q,Σ,q1,δ,F) be some NFA with Q={q1,,qo}, for some o, and Σ={𝚊1,,𝚊σ}, for some σ. We begin by an explanation of the basic idea. Then, we continue with a formal construction which is then given an intuitive explanation that utilizes images afterwards. Due to space constraints, the full proof of the correctness of the construction is given in Appendix A. Additionally, an example of an actual implementation using the RNA alphabet Σ𝚁𝙽𝙰={𝙰,𝚄,𝙲,𝙶} for a given NFA is provided in Appendix C.2.

Basic idea.

For each state qiQ, we construct a word siΣ that represents this state and the outgoing transitions from it, as defined by δ. Each period wΣ of the circular word wω, over which transcription is done, will be essentially made up of a concatenation of all si’s with the addition of one final word se, i.e., we obtain w=s1s2sose (see Figure 2). The suffix se handles transcription termination. Hairpin deletion will be used to simulate a transition between two connected states qi and qj, while reading a letter 𝚊Σ. This is done by jumping from si to sj by removing everything but the letter 𝚊 between si and sj. This is implemented using a two-step process for which context-sets will be deliberately designed such that these hairpin deletion steps actually simulate the process of reading letters in the automaton A. To terminate transcription, in final-states, a technical transition to the end of the template can be used that jumps to a factor tT at the end of w while obtaining a suffix which consists of a stem of size 2m at the end of the transcribed word. By that, we successively build prefixes of words in L(A) and may finalize the transcription process anytime a final state is reached, effectively obtaining only words in L(A).

Figure 2: Representation of circular transcription (maximally parallel hairpin deletion over wω).

Construction.

We continue with the formal construction of w. After that, we give an intuitive sketch of the functionality.222Consider reading the intuition-part in parallel to the construction-part for a better understanding. We need some general assumptions. θ is not defined specifically in the general case, but rather implied by the constraints we give in this proof. The logarithmic factor c is set to 1 in this construction, but larger numbers also work. The margin number n is set to 0 in this construction. Such a constant may be added but is not necessary for this construction to work. Σ is copied from the definition of A, but a distinct alphabet may simplify the construction. For practical implementation purposes, we use this assumption to keep the possible alphabet size as low as possible. Consider the construction in the Appendix to see an actual working example of an encoding for an alphabet size of 4. T and m are not explicitly specified, but construction constraints are given.

In the final construction, we obtain a word w=s1sose for words siΣ, i[o], representing the states, and a technical final word seΣ that is also responsible for transcription termination. From now on, for each newly introduced word, assume that its specification or characterization via constraints is given later on in the proof (if it is not given immediately or before). Also, assume that any word that occurs as one side of a context in the context-set C (which is to be constructed later) does not appear anywhere else by other overlapping factors in w.

We start with the construction of se. Let fs,feΣ be two words such that |fsfe|=2m and fs=θ(fe), hence, fsfe forms a stem of length 2m, i.e., fe=θ(fs). Let T={t} be a set of terminating sequences containing only one word tΣ. We set the suffix se by

se=θ(Send)Eendfetθ(Sq1)Eq1

for words Send,Eend,Eq1Σ. Next, for each state qiQ, we construct a word siΣ by setting

si=Si,1Si,kSi,k+1ei,1ei,kei,k+1θ(Sqi+1)Eqi+1

for words Sqi+1,Eqi+1,Si,1,,Si,k+1,e1,,ek+1Σ, with k being the number of outgoing transitions of qi. If qi=qo, then Sqi+1=Eqi+1=ε is just an empty suffix. Assume some arbitrary ordering on the outgoing transitions of qi and the transition labels to be given by 𝚊i,1 to 𝚊i,k. Then, each ei,j, for j[k], is defined by

ei,j=θ(Si,1Si,j)Ei,j𝚊i,jSqi,k

for the words Sqi,jEi,jΣ. Suppose that Si,j=Sqj for some j[o], let qi,j be the state reached by the jth transition of qi. If we have qiQF, i.e., it is not a final state, then we set Si,k+1=ei,k+1=ε to be empty. If we have qiF, i.e., it is a final state, then Si,k+1Σ+ is a nonempty word and ei,k+1 is given by

ei,k+1=θ(Si,1Si,k+1)Ei,k+1fsSend

for the word Ei,k+1Σ and the others as defined above. Notice that the only differences to the other ej’s are, first, the fact that, instead of a letter 𝚊, we write the word fs and, second, instead of the suffix Sqi,j, we add the word Send. This concludes all structural elements.

We continue with the definition of the context-set C over the words defined above. For each i[o], we assume (Sqi,Eqi)C. These are contexts responsible for jumping between different si and sj using hairpin deletion. In addition to i[o], for each j[k] (or j[k+1] if qiF), k being the number of outgoing transitions of the state qi as before, we assume (Si,1Si,j,Ei,j)C. Finally, we assume (Send,Eend)C, concluding the definition of C.

Again, we mention that we assume that each occurrence of a left or right context in w is exactly given by the above construction. More occurrences resulting from overlapping factors are excluded by assumption. Also, for each context (,r)C, we assume that each left context is chosen long enough so that it forms a valid hairpin with θ() regarding the logarithmic energy model for each occurrence of with the closest occurrence of the right context r after it. Keep in mind that each left context has a unique corresponding right context, so no non-deterministic choice can happen here. We continue with an intuitive explanation of the functionality of the construction.

Intuition.

Transcription starts in s1 and moves around wω over and over again (see Figure 2). Every time the transcription is at the beginning of one of the factors si, i[o], a nondeterministic choice of one of the encoded transitions occurs. By the selection of one of the factors Si,1, Si,1Si,2, …, Si,1Si,2Si,k as a left context, we are forced to use the corresponding right context Ei,j (if the prefix Si,1Si,j is chosen, jk). Everything in between gets removed by bounded hairpin deletion, using the stem Si,1Si,j with corresponding θ(Si,1Si,j) and a loop containing everything in between. This represents the choice of an outgoing transition of qi. Immediately after the removed hairpin, the letter 𝚊i,j occurs in si. See Figure 3 for a visualization of this process.

Figure 3: Nondeterministic selection of a transition out of the encoded qiQ in si. This example assumes 3 outgoing transitions and the respective letters 𝙰, 𝙱, and 𝙲. Depending on the choice of the left context (dark color - left), a corresponding right context (medium color - right) can be chosen and a hairpin with the corresponding θ part (dark color - right) is formed, having everything in between as part of the loop (bright color - middle). Finally, the whole marked region is then removed by hairpin deletion, leaving the blue marked letter as a new prefix.

Keep in mind, that the lengths of Si,j, j[j], and Ei,j have to be adapted accordingly for this to work. Due to maximally parallel hairpin deletion, the next left context in w, in particular in si, has to be chosen for hairpin deletion. The immediate next left context is Sδ(qi,𝚊j) from which we have to choose the unique right context Eδ(qi,𝚊) to jump to sqi,k. This is the aforementioned jump between si and sqi,k. See Figure 4 for a visualization of this process.

Figure 4: Representation of a jump between 2 state encodings si and sj. Assuming some transition to be selected and the corresponding part to be removed by hairpin deletion (orange marked region), a deterministic left and right context selection results in everything between the letter 𝙱 and the state encoding sj to re removed (purple marked region).

After these two steps (transition selection and jump to the next state), we can repeat this process. A prefix of a word in the language L(A) is successively obtained by maximally parallel bounded hairpin deletion. To terminate transcription, in a final state, we can use the same mechanism to choose the left context Si,1Si,2Si,kSi,k+1 and the right context Ei,k+1 to obtain a suffix stem fsfe which is followed by tT in wω, which results in transcription termination. See Figure 5 for a visualization.

Figure 5: Assuming si encodes a final state with k outgoing transitions, then a transition k+1 can be taken to obtain fs as a suffix after some previously obtained word (orange marked region). From there, a deterministic choice of left and right contexts Send and Eend allows for a jump to se (purple marked region), resulting in the suffix fsfet. As tT and fsfe forms a stem with length 2m by assumption, transcription can be terminated here.

Hence, as termination can only be initiated from final si’s that represent final states qiF, all words obtainable from wω by maximally parallel bounded hairpin deletion must be words in L(A). As all transitions are encoded and available to be chosen in the beginning of each si, we can also obtain all words in L(A), concluding this sketch. As mentioned before, due to space constraints, a full formal proof of correctness, i.e., that L(A)=[wω][Uncaptioned image]𝚙-𝚕𝚘𝚐S, can be found in Appendix A. An engineered example that uses actual words over the RNA alphabet Σ𝚁𝙽𝙰={𝙰,𝚄,𝙲,𝙶} can be found in the Appendix C.2.

For this construction to work, we clearly needed some assumptions regarding the hairpin deletion model to simulate co-transcriptional splicing. These are especially the assumptions that we are able to do this in a parallel manner (modeling the independence of subsequent co-transcriptional deletions) and that we have some level of greediness in the system (maximally parallel bounded hairpin deletion). Without some greediness assumption regarding the left context, an infinite number of points where deletion could happen can be skipped, and we could start simulating a computation on A at the beginning of every iteration of wω. We continue with a general framework that implements the construction from Theorem 5 using the RNA alphabet Σ𝚁𝙽𝙰={𝙰,𝚄,𝙲𝙶}, sketching the existence of working encodings.

3.1 Applying Theorem 5 for the RNA-Alphabet 𝚺𝚁𝙽𝙰={𝑨,𝑼,𝑪,𝑮}

For any working encoding, the following four questions need to be answered:

  1. 1.

    How to encode the internal transition selection in each si?

  2. 2.

    How to realize the transitions between states, i.e., the jump between arbitrary si and sj?

  3. 3.

    How to realize termination of the parallel hairpin deletion process?

  4. 4.

    Do the words that encode contexts (x,y)C appear only in the intended positions?

In the main part, we discuss these questions in a more general manner. A specific example of encoding a specific NFA using this general framework can be found in Appendix C.

Consider some NFA A=(Q,Σ𝚁𝙽𝙰,q1,δ,F) with Q={q1,,qo}, for some o. Let qiQ, for i[o], be some state in Q. Assume qi has k transitions, for some k. For each j[k], let 𝚊i,j be the letter on the jth transition of qi and let qi,jδ(qi,𝚊i,j) be the resulting state from qi by taking the jth transition.

(1) First, we define θ to represent the strongest base pair bonds, i.e., θ(𝙰)=𝚄, θ(𝚄)=𝙰, θ(𝙲)=𝙶, and θ(𝙶)=𝙲. To encode qi and its outgoing transitions in a word siΣ𝚁𝙽𝙰, generally, si will have the form as described in the construction given in Theorem 5, i.e., if w.l.o.g. qiF,

si=Si,1Si,kei1eikθ(Sqi+1)Eqi+1.

We can set

Si,1Si,k=𝙰𝙰(𝙲i)𝙰𝙰(𝙶𝙰𝙰)k.

Specifically, we set Si,1=𝙰𝙰(𝙲)i𝙰𝙰𝙶𝙰𝙰 and, for each j[k] with j>1, we set Si,k=𝙶𝙰𝙰. For example, if qi has 3 transitions, then

Si,1Si,3=𝙰𝙰(𝙲)i𝙰𝙰𝙶𝙰𝙰𝙶𝙰𝙰𝙶𝙰𝙰,

and we have Si,1=𝙰𝙰(𝙲)i𝙰𝙰𝙶𝙰𝙰 and Si,2=Si,3=𝙶𝙰𝙰. Now, by the definition of θ, we obtain, for each j[k],

θ(Si,1Si,j)=(𝚄𝚄𝙲)j𝚄𝚄(𝙲)i𝚄𝚄.

From before, we know that each ei,j, j[k], we have

ei,j=θ(Si,Si,j)Ei,j𝚊i,jSqi,j.

We obtain θ(Si,Si,j) implicitly by the previous definition. For simplicity reasons, we can set Ei,j=θ(Si,Si,j) (hence, equal to the θ(Si,Si,j) part). This concludes already the implementation of the contexts (Si,1Si,j,Ei,j)C. The letter 𝚊i,j is just the letter of the transition. The final parts of si that need to be defined are Sqi,j and Eqi+1. More generally, we need to define the words for each context (Sqm,Eqm)C, for m[o]. Let om be a positive number for each state qmQ. Then, we set Sqm=𝚄𝚄𝚄(𝙶)om𝚄𝚄𝚄 and Eqm=θ(Sqm). We discuss the numbers om later when discussing the second question from above. With that, we have defined all parts of si.

To select any transition, pick a prefix Si,1Si,j, j[k], find the corresponding right context Ei,j, and then notice that before Ei,j=θ(Si,1Si,j), there is another occurrence of θ(Si,1Si,j). We obtain the following hairpin structure (red) with corresponding left and right contexts and the remaining letter 𝚊i,j:

𝙰𝙰(𝙲)i𝙰𝙰(𝙶𝙰𝙰)jSi,1Si,j(𝚄𝚄𝙲)j𝚄𝚄(𝙶)i𝚄𝚄θ(Si,1Si,j)(𝚄𝚄𝙲)j𝚄𝚄(𝙶)i𝚄𝚄Ei,j𝚊i,jSqi,j

If, due to the logarithmic energy model, the length of the inner part between Si,1Si,j and θ(Si,1Si,j) gets too large, in the position of the last 𝙶 in Si,1Si,j (and in the position of the first 𝙲 in θ(Si,1Si,j), we can increase the number of 𝙶’s (and 𝙲’s respectively) until the stem is supports the length of the loop. We might have to do this for multiple transitions in a single state encoding si until the values balance out, but due to the exponential increase in supported size per letter in the stem, we can always reach a length that works out. This concludes the question on how to encode the transitions selection using the alphabet Σ𝚁𝙽𝙰.

(2) Next, we briefly discuss, how to realize the jump between transition-encodings si and sm. Notice, that we have already defined the encoding of the related contexts (Sqm,Eqm)C, m[o], by Sqm=𝙰𝙰𝙰(𝙲)om𝙰𝙰𝙰 and Eqm=θ(𝙰𝙰𝙰(𝙲)om𝙰𝙰𝙰)=𝚄𝚄𝚄(𝙶)om𝚄𝚄𝚄. As before with the number of G’s and Cs in the transition encodings, we left the number of G’s and C’s in this case open as well. Depending on the number of letters between each left and right context (Sqm,Eqm)C, a different number might be needed to obtain a stem that supports the length of the loop. In the logarithmic energy model, we can always find such numbers as a linear increase in stem size results in an exponential increase of supported loop size. In the example case above, notice that after 𝚊i,j, there is a unique occurrence of Sqi,j=𝙰𝙰𝙰(𝙲)om𝙰𝙰𝙰, assuming qm=qi,j. The unique occurrence of Eqm is right in front of sm. By the construction, we obtain the following unique hairpin formation. Here, we have no overlapping left contexts. So, no nondeterministic choice is possible, This hairpin deletion step must occur, if the letter before is used in an obtained word:

𝚊i,j𝙰𝙰𝙰(𝙲)om𝙰𝙰𝙰Sqm𝚄𝚄𝚄(𝙶)om𝚄𝚄𝚄θ(Sqm)𝚄𝚄𝚄(𝙶)om𝚄𝚄𝚄Eqmsm

This concludes the discussion on how to realize the jumps between state encodings si and sm.

(3) Finally, we need to encode the part that results in termination of the hairpin deletion process. For that, we define se, i.e., the words fs, fe, t, and in particular the words in the context (Send,Eend)C. In principle, we can consider se as its own state encoding but without any transitions. So, similar to the contexts (Sqm,Sqm)C, for m[o], we can define, for some oe that is not equal to any other om, Send=𝙰𝙰𝙰(𝙲)oe𝙰𝙰𝙰 as well as Eend=θ(Send)=𝚄𝚄𝚄(𝙶)oe𝚄𝚄𝚄. Additionally, we set t=(𝚄)10, fs=(𝙰)4𝙶(𝙰)4, and fe=(𝚄)4𝙲(𝚄)4=θ(fs). In the encoding of each final state si, for some qiF, we handle Si,k+1, assuming qi has k outgoing transitions, and Ei,k+1 analogously to any other transition and add fs in the place where the letter of a transition would have been. This results in the following 2 hairpin deletion steps in a final state. Assuming wΣ is a word in L(A) which has been obtained already as a prefix and qi being the final state reached after reading w, the following becomes a general possibility:

w𝙰𝙰(𝙲)i𝙰𝙰(𝙶𝙰𝙰)k+1Si,1Si,k+1(𝚄𝚄𝙲)k+1𝚄𝚄𝙶𝚄𝚄θ(Si,1Si,k+1)(𝚄𝚄𝙲)k+1𝚄𝚄𝙶𝚄𝚄Ei,n+1(𝙰)4𝙶(𝙰)4fs𝙰𝙰𝙰(𝙶)oe𝙰𝙰𝙰Send
w(𝙰)4𝙶(𝙰)4fs𝙰𝙰𝙰(𝙶)oe𝙰𝙰𝙰Send𝚄𝚄𝚄(𝙲)oe𝚄𝚄𝚄θ(Send)𝚄𝚄𝚄(𝙲)oe𝚄𝚄𝚄Eend(𝚄)4𝙲(𝚄)4fe𝚄𝚄𝚄𝚄𝚄𝚄𝚄𝚄𝚄𝚄t

(4) We can see that, if only the intended hairpin deletion steps are possible then this encoding effectively works, assuming the words in the contexts are pumped to be big enough. However, it needs to be made sure that hairpin deletion and context recognition can only occur in the intended positions and that no other factor can be used for that purpose. By checking the encodings, one can make sure, that no left or right context appears in unintended positions as a factor. See Appendix C.1 for a detailed examination on why this is the case.

This concludes all elements needed to encode an arbitrary NFA A in some circular word wω such that L(A)=[w][Uncaptioned image]𝚙-𝚕𝚘𝚐. As mentioned before, in Appendix C.2, an exemplary encoding of a specific NFA using this framework is given.

This concludes this section. It has been shown that the mechanism of log-hairpin deletion, modeling co-transcriptional splicing, can be used to encode any (infinite) regular language by encoding its DFA or NFA representation on a finite circular DNA word. As the construction grows with the size of the DFA or NFA at hand, minimizing the input NFAa is a second problem that can be looked at. Due to the practical nature of this problem, even restricted models of NFAs could be considered.

4 Minimizing Input NFA’s

The construction from the previous section makes it possible to produce any given finite language from powers of a word w by parallel hairpin deletion. As our goal is to engineer DNA templates that efficiently encode a given set of sequences to be produced by co-transcriptional splicing, Theorem 5 shows that we can achieve our goal, and we can focus on optimizing the length of w, i.e., the period. To allow further efficiency gains and some leeway for possible lab implementations of the construction, we can relax the requirement that a given finite set is produced exactly, that is, without any other words obtained by the system. With implementation in mind, we require the “extra” sequences produced to not interfere with the words in the target set of sequences, i.e., that they do not share hybridization sites with our target.

We can formalize this as a set of forbidden patterns to be avoided as factors by the “extra” words. In some cases, we may also assume that sequences above given lengths can be efficiently filtered out from the resulting set. We propose the following problem(s).

Problem 1 (Small automaton for cover set avoiding forbidden patterns).

Given a set W={w1,,wk} of target words and a set F={f1,,f} of forbidden patterns, such that each fiF is a factor of some wjW, find a smallest NFA M, such that:

  • WL, and

  • (LW)ΣFΣ=,

where either

  • (exact)

    L=L(M), in the general case, or

  • (cover)

    L=L(M)Σn for n=max{|w|wW}, in the length bounded setting of the problem.

Problem 1 can be either considered as a minimization problem for the number of states or a minimization problem for the number of transitions (notice that the size of the encoding in Section 3 primarily depends on the number of transitions). The following example shows that the problem is not trivial in the sense that there exist target and forbidden pattern sets W,F for which the smallest NFA M satisfying the conditions is smaller than both the minimal NFA for W and the minimal NFA for ΣFΣ¯W.

Example 6.

Let W={aabkaa} and F={abka}. It is easy to see that the minimal NFA for W must have at least k+5 states and k+4 transitions, otherwise it would have a loop which reads bk and it would accept an infinite language. Similarly, the minimal NFA for the language containing W but avoiding all other words having forbidden patterns from F, that is, ΣFΣ¯W, has at least |Q|Ω(k) states and Ω(k) transitions, otherwise, for accepting inputs of the form abjaa, with |Q|j<k the block of b’s would be read by some cycle labeled b. Together with the fact that abkaa must be accepted, we get that the machine would also accept abkaa, a forbidden word. However, a simple automaton M accepting aabaa with 5 states and 5 transitions satisfies the conditions, WL(M) and L(M) does not include any other word with a factor abka.

As the number of states implicitly gives an upper bound on the possible number of transitions of an NFA, the hardness of Problem 1(cover, states), which is the version of the problem where we are looking for the smallest cover automaton in terms of the number of its states rather than transitions, suggests that all versions of Problem 1 are NP-hard, as similar straightforward reductions are possible with simple target and forbidden sets. Due to space constraints, all proofs in this section can be found in the full version on arXiv.

Proposition 7.

Problem 1(cover, states) is NP-hard.

4.1 Practically Motivated Restricted NFA-models

As it is practically impossible to obtain infinite RNA sequences from circular DNA transcription, we propose the following NFA-like models that impose certain restrictions on valid computations. Encoding finite languages using those variants can reduce the size of the resulting machine w.r.t. to the minimal classical NFA encoding. Initially we also hoped that those restricted models would allow for more tractable minimization algorithms. In what follows, however, we show the NP-hardness for the minimization problem for all models we considered. The first restriction comes from the fact that the number of times the template word is “read” in circular transcription might be limited. There are various ways of expressing this limitation in terms of the computation of the automata. First, one can restrict the number of times that a certain state of the automaton can be reached in a computation.

Definition 8.

A state-step-bounded nondeterministic finite automaton (SSB-NFA) A is a hexatuple A=(Q,c,Σ,δ,q0,F) where Q is a finite set of states, c denotes the state-step-bound, Σ denotes a finite alphabet, δ:Q×ΣQ denotes the transition function, q0Q the initial state, and FQ the set of final states.

The language L(A) of a SSB-NFA is the language of all words wΣ where δ(q0,w)F and w has some accepting computation where each state qi occurs at most c times.

Next, in the encoding of NFAs onto templates for hairpin deletion proposed in Theorem 5, different states are encoded consecutively. So, simulating a transition from an earlier encoded state to a later encoded state still occurs on a single repetition of the template. Hence, we could also impose an order <Q on the states and restrict the number of times that qi follows qj in a computation if qi<Qqj, effectively resulting in having to use another repetition in wω. This results in the notion of return-bounded nondeterministic finite automata (RB-NFAs). Their formal definition is given in Appendix B. Finally, another practical limitation might impose a bound on the length of the intron that is removed during hairpin deletion. This can be represented by imposing an order <Q on Q, setting a distance for each 2 subsequent elements over <Q, and setting a bound on the maximum forward distance between two subsequent states in a computation over the automaton. This results in the notion of distance-bounded nondeterministic finite automata (DB-NFAs). Their formal definition is given in Appendix B as well. The class of all SSB-NFAs, RB-NFAs, and DB-NFAs is denoted by CFA. We will investigate SSB-NFAs as an exemplary case and see that we can obtain NP-hardness results for the decision variant of the minimization problem for all models in the class CFA using very similar proofs.

4.2 Properties of State-Step-Bounded NFAs

SSB-NFAs restrict the number of times we are allowed to be in a specific state. Hence, for SSB-NFAs (and RB-NFAs) A, we know that L(A) is a finite language. One big advantage of SSB-NFAs is their potential to be significantly smaller than NFAs recognizing the same language. When encoding a finite language as NFA, each word in the language must occur as the label of a simple path in the state graph as any loop in the NFA with a path to a final state results in an infinite language. In SSB-NFAs, certain repetitions may be represented with a small loop. Consider the following example.

Example 9.

Let L={𝚊,𝚊𝚋𝚊,𝚊𝚋𝚊𝚋𝚊,𝚊𝚋𝚊𝚋𝚊𝚋𝚊,𝚊𝚋𝚊𝚋𝚊𝚋𝚊𝚋𝚊}. Let A=(Q,Σ,δ,q0,F) be a minimal NFA represented by the left state diagram in Figure 6 such that L=L(A). We see that there exists a (minimal) SSB-NFA B=(Q,5,Σ,δ,q0,F) represented by the state diagram on the right in Figure 6 for which we also have L(B)=L=L(A), but its size is significantly smaller than A regarding the number of transitions and states.

Figure 6: One minimal NFA A=(Q,Σ,δ,q0,F) (left) and minimal SSB-NFA B=(Q,5,Σ,δ,q0,F) (right) recognizing the same language L found in example 9.

Indeed, we can find for any NFA A accepting a finite language a SBB-NFA B which is smaller or equal to the size of A in terms of the number of states or transitions but which accepts the same language. Trivially, as a NFA recognizing a finite language has no loops on paths that reach final states, we can essentially copy the whole automaton and set the step-bound c to any value to obtain a SSB-NFA which recognizes the same langauge. As we are interested in finding minimal templates for contextual splicing, we propose the following decision variant of the SSB-NFA minimisation problem.

Problem 2 (SSB-NFA-Min).

Let L be some regular language and let c,kN be some positive integers. Does there exist some SSB-NFA A=(Q,c,Σ,δ,q0,F) such that L(A)=L with either |Q|k (SSB-NFA-Min-States) or |δ|k (SSB-NFA-Min-Transitions).

We know that the answer to both problems is 𝚏𝚊𝚕𝚜𝚎 if the input language L is an infinite language. This can be efficiently checked for any representation, i.e., DFAs, NFAs, or regular expressions. Due to the sizes of different representations of regular languages, we obtain the following results.

Proposition 10.

𝚂𝚂𝙱-𝙽𝙵𝙰-𝙼𝚒𝚗 is in NP if the input language L is given as a finite list of words. If L is presented as a regular expression or NFA, then the problem is in PSPACE.

Similar to the decidability version of the problem of finding minimal NFAs for finite languages (see [10, 1]), we can show that 𝚂𝚂𝙱-𝙽𝙵𝙰-𝙼𝚒𝚗 is NP-hard by a reduction from the biclique covering problem.

Proposition 11.

𝚂𝚂𝙱-𝙽𝙵𝙰-𝙼𝚒𝚗 (states/transitions) is NP-hard for all alphabets Σ with |Σ|2.

By that, we can conclude the results of this section in the following theorem.

Theorem 12.

The 𝚂𝚂𝙱-𝙽𝙵𝙰-𝙼𝚒𝚗 problem is NP-complete for all alphabets |Σ|2 if the input language is given as a list of words.

For the minimization problems of RB-NFAs and DB-NFAs, almost the exact same reduction as for SSB-NFAs can also be applied to obtain NP-hardness in these models. A sketch of the differences is also given in the full version on arXiv. For the following result, assume that 𝚁𝙱-𝙽𝙵𝙰-𝙼𝚒𝚗 and 𝙳𝙱-𝙽𝙵𝙰-𝙼𝚒𝚗 are defined analogously to 𝚂𝚂𝙱-𝙽𝙵𝙰-𝙼𝚒𝚗.

Proposition 13.

𝚁𝙱-𝙽𝙵𝙰-𝙼𝚒𝚗 and 𝙳𝙱-𝙽𝙵𝙰-𝙼𝚒𝚗 are NP-hard for all alphabets Σ with |Σ|2.

This concludes the results regarding the automata models in CFA. As even in these restricted cases we still have NP-hardness for minimization, this motivates identifying target languages with restricted structure for which efficient minimization algorithms may exist.

5 Conclusion

In this paper, we provided a framework which shows that any regular language can be obtained from circular words using maximally bounded parallel logarithmic hairpin deletion. This indicates that co-transcriptional splicing may be utilized to encode and obtain any set of RNA sequences representable by regular languages, even potentially infinite ones, from a circular DNA template of finite size. As the construction is based on the NFA representation of the encoded regular language, the problem of minimizing NFAs has been further investigated, in particular for well motivated restricted NFA-like models (SSB-NFAs, RB-NFAs, and DB-NFAs). As only hardness results could be obtained so far, future research could work on identifying practically motivated classes of languages for which we can efficiently find small NFAs, SSB-NFAs, RB-NFAs, or DB-NFAs. In addition to that, another future challenge lies in applying hairpin deletion to obtain language classes with higher expressibility, e.g. context-free or context-sensitive languages. But whether this is possible is left as an open question for which we have no specific conjecture so far.

Question 14.

Let LG be some context-free language. Does there exist a word wΣ and a properties tuple for log-hairpin deletion S, such that L(A)=[wω][Uncaptioned image]𝚙-𝚕𝚘𝚐S?

References

  • [1] Jérôme Amilhastre, Philippe Janssen, and Marie-Catherine Vilarem. Fa minimisation heuristics for a class of finite languages. In Oliver Boldt and Helmut Jürgensen, editors, Automata Implementation, pages 1–12, Berlin, Heidelberg, 2001. Springer Berlin Heidelberg.
  • [2] Ann L Beyer and Yvonne N Osheim. Splice site selection, rate of splicing, and alternative splicing on nascent transcripts. Genes & development, 2(6):754–765, 1988.
  • [3] Yves d’Aubenton Carafa, Edward Brody, and Claude Thermes. Prediction of rho-independent escherichia coli transcription terminators: a statistical analysis of their rna stem-loop structures. Journal of molecular biology, 216(4):835–858, 1990.
  • [4] Da-Jung Cho, Szilárd Zsolt Fazekas, Shinnosuke Seki, and Max Wiedenhöft. A formalization of co-transcriptional splicing as an operation on formal languages, 2025. doi:10.48550/arXiv.2504.13354.
  • [5] Gloria Del Solar, Rafael Giraldo, María Jesús Ruiz-Echevarría, Manuel Espinosa, and Ramón Díaz-Orejas. Replication and control of circular bacterial plasmids. Microbiology and molecular biology reviews, 62(2):434–464, 1998.
  • [6] Thomas R Einert and Roland R Netz. Theory for RNA folding, stretching, and melting including loops and salt. Biophysical journal, 100(11):2745–2753, 2011.
  • [7] Michael R Garey and David S Johnson. Computers and intractability, volume 29. W.H. Freeman New York, 2002.
  • [8] Cody Geary, Pierre-Étienne Meunier, Nicolas Schabanel, and Shinnosuke Seki. Oritatami: a computational model for molecular co-transcriptional folding. International Journal of Molecular Sciences, 20(9):2259, 2019.
  • [9] Cody Geary, Paul WK Rothemund, and Ebbe S Andersen. A single-stranded architecture for cotranscriptional folding of RNA nanostructures. Science, 345(6198):799–804, 2014.
  • [10] Hermann Gruber and Markus Holzer. Computational complexity of NFA minimization for finite and unary languages. In Remco Loos, Szilárd Zsolt Fazekas, and Carlos Martín-Vide, editors, LATA 2007. Proceedings of the 1st International Conference on Language and Automata Theory and Applications, volume Report 35/07, pages 261–272. Research Group on Mathematical Linguistics, Universitat Rovira i Virgili, Tarragona, 2007.
  • [11] John E. Hopcroft and Jeffrey D. Ullman. Introduction to Automata Theory, Languages and Computation. Addison-Wesley, 1979.
  • [12] Tao Jiang and Bala Ravikumar. Minimal nfa problems are hard. SIAM Journal on Computing, 22(6):1117–1141, 1993. doi:10.1137/0222067.
  • [13] Lila Kari, Elena Losseva, Stavros Konstantinidis, Petr Sosik, and Gabriel Thierrin. A formal language analysis of DNA hairpin structures. Fundamenta Informaticae, 71:453–475, 2006. URL: http://content.iospress.com/articles/fundamenta-informaticae/fi71-4-05.
  • [14] Lila Kari and Gabriel Thierrin. Contextual insertions/deletions and computability. Information and Computation, 131(1):47–61, 1996. doi:10.1006/INCO.1996.0091.
  • [15] David Maier. The complexity of some problems on subsequences and supersequences. Journal of the ACM (JACM), 25(2):322–336, 1978. doi:10.1145/322063.322075.
  • [16] Evan C. Merkhofer, Peter Hu, and Tracy L. Johnson. Introduction to Cotranscriptional RNA Splicing, pages 83–96. Humana Press, Totowa, NJ, 2014.
  • [17] Evan C Merkhofer, Peter Hu, and Tracy L Johnson. Introduction to cotranscriptional RNA splicing. Springer, 2014.
  • [18] Marina M O’reilly, Mark T Mcnally, and Karen L Beemon. Two strong 5’ splice sites and competing, suboptimal 3’ splice sites involved in alternative splicing of human immunodeficiency virus type 1 RNA. Virology, 213(2):373–385, 1995.
  • [19] Xavier Roca, Ravi Sachidanandam, and Adrian R Krainer. Determinants of the inherent strength of human 5’ splice sites. Rna, 11(5):683–698, 2005.
  • [20] V Stewart, R Landick, and C Yanofsky. Rho-dependent transcription termination in the tryptophanase operon leader region of escherichia coli K-12. Journal of bacteriology, 166(1):217–223, 1986.

Appendix A Proof of Correctness for the Construction in Theorem 5

In this section, a formal correctness proof for the construction given regarding Theorem 5 is provided. First, we show that L(A)[wω][Uncaptioned image]𝚙-𝚕𝚘𝚐S given a constructive argument. Then, we show that [wω][Uncaptioned image]𝚙-𝚕𝚘𝚐SL(A) by an inductive argument of the hairpin deletion steps that have to be taken in order for parallel hairpin deletion to terminate.

Formal proof of first direction (𝑳(𝑨)[𝒘𝝎][Uncaptioned image]𝚙-𝚕𝚘𝚐𝑺).

Assume uL(A). Assume qi1qi2qi|u|+1 to be a sequence of states to obtain u in L(A), i.e., such that qi1=q1, q|u|+1F, and for all ij,ij+1, j[|u|], we have qij+1δ(qij,u[j]). Assume kj to mark the number of the transition taken to obtain qij+1 from qij with u[j]. By the construction above, we know that we can factorize wω into

v1u[1]v2,v2,ru[2]u[|u|1]v|u|,v|u|,ru[|u|]v|u|+1,v|u|+1,rfsv|u|+2fetywω

such that tT, y=θ(Sq1)Eq1, and

v1 =S1,1S1,1θ(S1,1S1,k1)E1,1,
vj, =Sqij+1θ(Sqij+1)Eqij+1, for j[|u|+1]{1},
vj,r =Sj+1,1Sj+1,kj+1θ(Sj+1,1Sj+1,kj+1)Ekj+1, for j[|u|]{1},
v|u|+1,r =Si|u|+1,1Si|u|+1,k|u|+1θSi|u|+1,1Si|u|+1,k|u|+1E|u+1|, and
v|u|+2 =Sendθ(Send)Eend.

Each word v marks a single factor completely removed by a single hairpin-removal step. As each right context appears uniquely in w, we assume that the first following occurrence in wω is taken. As we always pick a left context right after a removed factor and as each u[i] cannot be part of a left context by assumption in the construction, we know that this factorization is valid for maximal parallel bounded hairpin deletion. Intuitively, v1 is a factor removed by hairpin deletion that determines the first transition that is taken. For each j[|u|+1]{1}, the factor vj, is a factor that can be completely removed by haipin-deletion that brings us to the beginning of some factor si, i[|Q|], from which point on we can now select the next transition. The factor vj,r, for j[|u|]{1}, similar to v1, again is a factor removed by hairpin deletion that determines the next transition that is taken. Finally, we continue with two last removed hairpins that result in transcription termination. As qi|u|+1 is a final state, the factor v|u|+1,r actually exists and can be removed by hairpin deletion. It is followed by fs. Then, in a last step, we can remove v|u|,r to obtain a suffix fsfe in the transcribed word that is followed by tywω, resulting in termination, i.e., resulting in u[wω][Uncaptioned image]𝚙-𝚕𝚘𝚐S.

Formal proof second direction ([𝒘𝝎][Uncaptioned image]𝚙-𝚕𝚘𝚐𝑺𝑳(𝑨)).

Next, we need to show that we cannot obtain any word that is not in L(A). Let u[wω][Uncaptioned image]𝚙-𝚕𝚘𝚐S. By terminating hairpin deletion, we know that there exists a prefix wpt of wω, for tT and wpΣ+, such that wp[Uncaptioned image]us, for some suffix sΣm of length m that forms a hairpin. For wp, we know by the definition of parallel bounded hairpin deletion that there exists some n such that

wp=u1v1unvnun+1

with uiΣ, for all i[n+1], and vjΣ+, for all j[n], for which we have vj[Uncaptioned image]ε, and such that us=u1un+1. By the definition of maximal parallel bounded hairpin deletion, we know that for all (x,y)C we have xFact(ui), for all i[n+1]. For all j[n], we know that vj starts with some left context x for some (x,y)C.

First, we show inductively that we can only obtain letters that can be read during a computation of A. For that, we show that u1 must be empty, that v1 contains only a single deleted hairpin, that u2 is just a single letter that represents a label of some outgoing transition of q1 and that v2 consists of exactly two subsequently removed hairpins (for simplicity reasons, in this proof, we assume that a single vi may contain multiple subsequently removed hairpins, also resulting in the empty word. Following the formal model, between each subsequently removed hairpin, a factor u would have to be added, which is set to the empty word. In this proof, we always assume that such a factorization is possible if we talk about subsequently removed hairpins in a single factor v.) By induction, and using the same arguments used for the basic step, we obtain that all ui, for i[n1]1 must be single letters representing labels from transitions in A, that all vi, with i[n1] being odd, contain just a single removed hairpin, and that all vj, with j[n1] being even, contain exactly two subsequently removed hairpins (hence, could be split up to a factor containing two distinct hairpins vj1 and vj2 that surround the empty word in the above factorization). After that, we proof the terminating condition, resulting in u being actually in L(A).

Suppose u1ε. Then v1 does not start at w[1]. But then, the next possible starting position for v1 is the next occurring left context x of some (x,y)C that does not start in w[1]. By construction, this is the left context Sqk for some state qkQ that is located after the first encoded transition. However, if v1 starts with that or a later occurring left context, then, e.g., S1,1Fact(u1), which is a contradiction as (s1,1,E1,1)C. So, u1=ε and v1 has a prefix S1,k for some kth transition of state q1.

By construction, we know that there exists a unique right context E1,k such that (S1,k,E1,k)C. Suppose v1 is factorized into v1=v1,1v1,2 (or more factors) such that, for both i[2], we have v1,i[Uncaptioned image]ε. Then, v1,1=S1,kE1,k and v1,2 would start immediately after E1,k in w. As before, v1,2 would need to start with a left context x for some (x,y)C. However, by construction of w we know that after E1,k follows just the letter of the kth transition of q1, but no left context. So, this is a contradiction and we have v=v1,1 as well as u2[1]=𝚊1,k, where 𝚊1,k represents the letter of the kth transition of q1.

Now, suppose that u2 has length |u2|>1. Then v2 cannot start with the left context Sδ(q1,𝚊1k). But then, there are only 3 possibilities for the next occurring left context in wω. Either, it is the left context Sδ(q1,𝚊1,k+1), i.e., the left context representing the jump to another state for the k+1th transition of q1, or, if q1 is also a final state, then the left context Send which is used in the termination process, or, if q1 is not a final state and q1 only has k transitions, the collection of left contexts used to determine the choice of transitions in the state q2, e.g., S2,1, S2,2, and so on. No matter the choice of the next left context, we observe that their occurrences are disjoint from the occurrence of Sδ(q1,𝚊1,k). Hence, u2 has a prefix 𝚊1,kSδ(q1,𝚊1,k), which is a contradiction to the definition of maximal parallel bounded hairpin deletion. So, we can only have u1=𝚊1,k and that v2 has the prefix Sδ(q1,𝚊1,j).

For readability purposes, until defined otherwise, from now on assume that qi=δ(q1,𝚊i,j). Hence, Sδ(q1,𝚊1,j)=Sqi. In contrast to v1, for v2, we have to show that it always contains exactly 2 subsequent hairpin deletion steps. First, as v2 has the prefix Sqi for which only the unique right context Eqi in (Sqi,Eqi)C exists, we know that v2 has a prefix SqiEqi. By the definition of wω, we know that immediately after that, we are in the beginning of the factor si. As si is analogously constructed to s1, we know by analogous arguments to the ones made for u1 and v1, that we have to start with some context (Si,k,Ei,k)C, k referring to the selected transition, to continue. This results in v2 containing at least 2 subsequently removed hairpins. We know by the arguments from before that after the occurrence of Ei,k in wω, there follows some letter 𝚊i,k which represents the letter from the kth transition of qi. Also, we know by the same arguments that this occurrence of 𝚊i,k is not part of any left context in C. Hence, v2 consists of exactly two subsequent hairpin deletion steps.

In addition, we obtain that u3 has the letter 𝚊i,k as a prefix. By analogous arguments from before, we get that u2, and in particular every following ui, at least for i[n1]1 (shown in terminating condition), is a single letter representing the letter of some transition of A. By induction, we also know that the order of those always represents a valid computation on A.

So, by now we know that u must always have a prefix that represents the prefix of some word in L(A). What’s left to show is that termination can only occur, when u actually is in L(A). This can be done by a combinatorial argument on the construction of wω.

First, we know that we can only terminate before the position of the factor tT in w. Also, we know that we must need a hairpin stem of length 2m obtained by hairpin deletion immediately before that occurrence of t. By the construction of se in w, we know that t is preceded by the factor fe of length m. fe is not part of any context. Hence, un+1 has a suffix fe. We must obtain the suffix θ(fe)fe by hairpin deletion from wω. By construction, we know θ(fs)=fe, so θ(fe)=fs. By construction, we also know that fe is preceded by Eend which is, by assumption, not a factor of fs. Hence, Eend needs to be removed by hairpin deletion, resulting in un+1=fe. This is only possible with the corresponding context (Send,Eend)C which can only be found in the encoding words si of final states qiF. By construction, these occurrences of Send are preceded by fs. Also by construction, fs is preceded by some right context Ei,k+1, assuming the encoding si of some final state qi with k transitions. Using the inductive arguments from before, we know that we can only obtain fs as part of some factor un if and only if we select the context (Si,1Si,k+1,Ei,k+1) as the last removed hairpin in vn1. Hence, un=fs.

Also by the inductive arguments from before, we know that vn1 consists of exactly two subsequently removed hairpins, where the first one is enclosed by (Sqi,Eqi). For all other previous uj, for j[n1], we know that they are all single letters representing a prefix of a valid computation on A. As we can obtain fs only in final states, we know that u1un1L(A). Also, we know that the suffix of length 2m is actually s=fsfe=unun+1. So, u=u1un1 and by that uL(A). This concludes this direction and the proof of this construction.

Appendix B Formal Definitions of RB-NFAs and DB-NFAs

Definition 15.

A return-bounded nondeterministic finite automaton (RB-NFA) A is a hexatuple A=(Q,c,Σ,δ,q1,F) where Σ,δ,q0, and F are defined as for usual NFAs, c denotes a return-bound and Q={q1<<q|Q|} is an ordered finite set of states. A RB-NFA accepts a word wΣ if δ(q1,w)F and there exists an accepting path q1=p1p=δ(q1,w) with no more than c pairs pi>pi+1 for i[1].

Definition 16.

A distance-bounded nondeterministic finite automaton (DB-NFA) A is a heptatuple A=(Q,fd,c,Σ,δ,q1,F) where Q is an ordered set of states, Σ,δ,q0, and F are defined as for usual NFAs, c denotes a distance-bound, and fd:Q×Q is a distance function such that fd(qi,qi+1) for i[|Q|1], fd(q|Q|,q1), fd(qi,qj)=i[i..j1]fd(qi,qi+1) if i<j1, and fd(qi,qj)=fd(qi,q|Q|)+fd(q|Q|,q1)+fd(q1,qj) if j<i. A DB-NFA accepts a word wΣ if δ(q0,w)F and there exists an accepting path q0=qi1q=δ(q1,w) satisfying the condition that fw(qij,qij+1)c for j[1].

Appendix C Additional Content Regarding the Example for Theorem 5 using the RNA Alphabet 𝚺𝑹𝑵𝑨={𝙰,𝚄,𝙲,𝙶}

C.1 Detailed Examination of Factors in the General Construction

This detailed examination is related to point (4) in Section 3.1.

For all contexts (Si,1Si,j,Ei,j)C, we see that they are bordered by either 𝙰𝙰 or 𝚄𝚄, followed and preceded by either 𝙲 or 𝙶. For all Si,1Si,j, we know that after the first 𝙰𝙰, there occur i many 𝙲’s. We notice that there is only one position in the word where 𝙰𝙰(𝙲)i occurs and that is in the beginning of si. In addition to that, now regarding Ei,j, we notice that there are two positions where the corresponding (𝙶)i𝚄𝚄 occurs, and these are in either in the end of θ(Si,1Si,j) or the end of Ei,k. As we need an occurrence of a right context that is preceded by a valid hairpin, we see that the first occurrence has to be used in the hairpin-formation between Si,1Si,j and θ(Si,1Si,j), and we see that the second occurrence can then be used as a right context. So, no other unintended placements are possible. The same holds for each context (Sqi,Eqi)C which are bordered by 𝙰𝙰𝙰 (resp. 𝚄𝚄𝚄), encapsulating a unique number of oi many 𝙲’s or 𝙶’s. As before, the word used in the right context also occurs two times, once for the hairpin formation and once for the right context recognition. The first occurrence has to be used to form a valid hairpin and the second one, as the margin is set to 0, has to be a consecutive factor, i.e., the intended occurrence of Eqi. The same can be said analogously for (Send,Eend)C. Hence, this construction generally results in no unintended factors, as long as the numbers of 𝙲’s and 𝙶’s between the respective borders are pumped up enough. One has to make sure that the letters from the transitions 𝚊i,k are not part of some context. But this is prevented as these are always preceded by 𝚄𝚄 in the end of Ei,k and followed by 𝙰𝙰 in Sqi,k. So, they cannot be part of any context. The words fs and fe might also interfere with some context, but as both are bordered by (𝙰)4 (resp. (𝚄)4), this cannot happen. The termination word t=(𝚄)10 can also not be part of any context, as is is preceded by (𝚄)4 in fe and followed by 𝚄𝚄𝚄 in θ(Sq1) (see construction in proof of Theorem 5 to verify this).

C.2 Specific Example Construction

This section contains a specific example for the construction of a circular word wω for some NFA A for which we have L(A)=[wω][Uncaptioned image]𝚙-𝚕𝚘𝚐, using the framework given in Section 3.1. Let A=(Q,Σ𝚁𝙽𝙰,q1,δ,F) be some NFA that is defined by the following graph representation:

Figure 7: NFA A with Q={q1,q2}, F={q2}, and δ={((q1,𝙰),q2),((q2,𝚄),q1),((q2,𝙰),q2)}. So it accepts the regular language 𝙰(𝚄𝙰).

We use the construction in Theorem 5 and encoding given above to obtain a word w for which we have [wω][Uncaptioned image]𝚙-𝚕𝚘𝚐S=L(A) for the properties tuple S=(Σ𝚁𝙽𝙰,θ,1,C,0,{t},9). Notice that the margin is of length 0, that the set of terminating sequences only contains t=(𝚄)10, that the log factor is 1, that the alphabet is Σ𝚁𝙽𝙰, and that θ is given as above.

First, we set the state encodings s1 and s2. Notice that q1 only has 1 outgoing transition while q2 has two outgoing transitions and is a final state. For s1, we obtain

s1=𝙰𝙰𝙲𝙰𝙰𝙶𝙰𝙰S1,1𝚄𝚄𝙲𝚄𝚄𝙶𝚄𝚄θ(S1,1)𝚄𝚄𝙲𝚄𝚄𝙶𝚄𝚄E1,1𝙰𝚊1,1𝙰𝙰𝙰(𝙶)6𝙰𝙰𝙰Sq2e1,1𝚄𝚄𝚄(𝙶)6𝚄𝚄𝚄θ(Sq2)𝚄𝚄𝚄(𝙶)6𝚄𝚄𝚄Eq2.

Notice that we set o2=6 in Sq2 and Eq2 due to the total size of the word. For s2, due to space constraints, we first give the abstract structure and then the assignment to each word. Notice that s2 is the last state encoding before appending se to w. Hence, no θ(Sq3)Eq3 has to be added in the end. We have

s2=S2,1S2,2S2,3θ(S2,1)E2,1𝚄Sq1e1θ(S2,1S2,2)E2,2𝙰Sq2e2θ(S2,1S2,2S2,3)E2,3fsSende3

for the following assignments:

S2,1S2,2S2,3=𝙰𝙰𝙲𝙲𝙰𝙰𝙶𝙰𝙰S1,1𝙶𝙰𝙰S2,2𝙶𝙰𝙰S2,3
θ(S2,1)E2,1𝚄Sq1=𝚄𝚄𝙲𝚄𝚄𝙶𝚄𝚄θ(S2,1)𝚄𝚄𝙲𝚄𝚄𝙶𝚄𝚄E2,1𝚄𝙰𝙰𝙰(𝙲)5𝙰𝙰𝙰Sq1
θ(S2,1S2,2)E2,2𝙰Sq2=𝚄𝚄𝙲𝚄𝚄𝙲𝚄𝚄𝙶𝚄𝚄θ(S2,1θ(S2,2))𝚄𝚄𝙲𝚄𝚄𝙲𝚄𝚄𝙶𝚄𝚄E2,2𝙰𝙰𝙰𝙰(𝙲)6𝙰𝙰𝙰Sq2
θ(S2,1S2,2S2,3)E2,3fsSend=𝚄𝚄𝙲𝚄𝚄𝙲𝚄𝚄𝙲𝚄𝚄𝙶𝚄𝚄θ(S2,1S2,2S2,3)𝚄𝚄𝙲𝚄𝚄𝙲𝚄𝚄𝙲𝚄𝚄𝙶𝚄𝚄E2,3(𝙰)4𝙶(𝙰)4fs𝙰𝙰𝙰(𝙲)7𝙰𝙰𝙰Send

Notice that we set, in addition to o2=6, the numbers o1=5 and oe=7 for the words Sq1,Eq1, Send (thus also for Eend in se). The numbers are selected such that hairpin deletion regarding the logarithmic model works. Finally, se is encoded in the following manner:

se=𝚄𝚄𝚄(𝙶)7𝚄𝚄𝚄θ(Send)𝚄𝚄𝚄(𝙶)7𝚄𝚄𝚄Eend(𝚄)4𝙲(𝚄)4fe𝚄𝚄𝚄𝚄𝚄𝚄𝚄𝚄𝚄𝚄t𝚄𝚄𝚄(𝙶)5𝚄𝚄𝚄θ(Sq1)𝚄𝚄𝚄(𝙶)5𝚄𝚄𝚄Eq1

Remember, that θ(Sq1)Eq1 is needed to jump to s1. It is encoded in the end of se such that each word starts with the left context S1,1 in s1. In total, we set

w=s1s2se.

We have to check that each intended hairpin can actually be formed, regarding the length bound obtained by the logarithmic energy model. But, for that we notice that each left context for the transition selection has at least length |S1,1|=8, which supports a loop of length 28=256, which clearly works out inside of each s1 and s2. Regarding jumps between s1, s2 and se, notice that the shortest context responsible for such a jump, Sq1, has length |Sq1|=11, which supports a loop of length 211=2048. That is longer that w itself, hence such a log-hairpin can also always be formed.

We obtain [w][Uncaptioned image]𝚙-𝚕𝚘𝚐S=L(A) by the proof of Theorem–5 and the fact, that no factor occurs in an unintended position.