Abstract 1 Introduction 2 Preliminaries 3 Fault Detection and Fault Identification 4 Model-dependent Fault Detection 5 Conclusions References Appendix A Proofs of theorems

Fault Detection and Identification by Autonomous Mobile Robots

Stefano Clemente ORCID Department of Computer Science, Università degli Studi di Milano, Italy Caterina Feletti ORCID Department of Computer Science, Università degli Studi di Milano, Italy
Abstract

The Look-Compute-Move model (LCM) is adopted to study swarms of mobile robots that have to solve a given problem. Robots are generally assumed to be autonomous, indistinguishable, anonymous, homogeneous and to move on the Euclidean plane. Different LCM sub-models have been theorized to study different settings and their computational power. Notably, the literature has focused on four base models (i.e., 𝒪𝒪𝒯, 𝒮𝒯𝒜, 𝒞𝒪, 𝒰) that differ in memory and communication capabilities, and in different synchronization modes (e.g., fully synchronous FSYNCH, semi-synchronous SSYNCH).

In this paper, we consider fault-prone models where robots can suffer from crash faults: each robot may irremediably stop working after an unpredictable time. We study the general Fault Detection (FD) problem which is solved by a swarm if it correctly detects whether a faulty robot exists in the swarm. The Fault Identification (FI) problem additionally requires identifying which robots are faulty. We consider 12 LCM sub-models (𝒪𝒪𝒯, 𝒮𝒯𝒜, 𝒞𝒪, 𝒰, combined with FSYNCH, SSYNCH, and the round-robin RROBIN) and we study the (im)possibility of designing reliable procedures to solve FD or FI. In particular, we propose three distributed algorithms so that a swarm can collectively solve FD under the models 𝒰FSYNCH, 𝒞𝒪FSYNCH, and 𝒰RROBIN.

Keywords and phrases:
Autonomous mobile robots, Faulty robots, Look-Compute-Move, Fault detection, Fault identification, Round-robin
Copyright and License:
[Uncaptioned image] © Stefano Clemente and Caterina Feletti; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Distributed algorithms
Editors:
Kitty Meeks and Christian Scheideler

1 Introduction

Imagine having a swarm of firefighter robots still in an area, aimed at surveilling it from fires and extinguishing them. Such robots perform an infinite sequence of computational cycles: at each clock, some of them are activated, they look at the environment, they process the snapshot just taken and, if a fire has been sensed, they properly move to fight it. They are provided with limited, energy-saving message protocols for intra-swarm communication; external communication is allowed only in case of emergency, due to its consumption.

From a theoretical point of view, systems and scenarios like this have been deeply studied over the last two decades. Specifically, the Look-Compute-Move (LCM) model [19] has been adopted to study dynamic distributed systems and formalize swarm of elementary and mobile robots with computational capabilities (network of sensors, drones, mobile software agents…). In this model, an activated robot performs an LCM cycle, so it looks at its surroundings, computes its next position by executing a distributed algorithm, and then reaches it. Much literature has been interested in designing and analyzing distributed algorithms the swarm can use to solve a collaborative task, called problem. Typically, robots are required to arrange according to some conditions (Pattern Formation [11, 17, 27, 29], Gathering [10, 16, 18, 21], or Scattering [20, 23]), or to travel in the environment (Flocking [7], Exploration [3]). Generally, robots are assumed to be autonomous, homogeneous, indistinguishable, anonymous, and punctiform entities in the Euclidean plane. According to storage and communication capabilities, four settings have been proposed to study different computational powers: 𝒪𝒪𝒯 (oblivious and silent robots), 𝒮𝒯𝒜 (silent but with constant-size storage capabilities), 𝒞𝒪 (oblivious but with constant-size communication capabilities) and 𝒰 (with constant-size communication and storage capabilities). Transversally, different modes have been proposed in terms of robot activation and synchronization. In the semi-synchronous mode (SSYNCH), time is split into atomic rounds and at each round, an arbitrary non-empty subset of the swarm is activated and it executes an LCM cycle in perfect synchrony. The fully synchronous mode (FSYNCH), where all the robots are activated at each round, is the most studied SSYNCH sub-mode; instead, little attention was given to the round-robin mode (RROBIN), where all robots are activated one by one before being reactivated with the same sequential order [15].

Most of the literature has considered fault-free (aka correct) robots: robots have infinite precision in perception and computation, they correctly compute the target destination, they always reach the computed destination, and they never crash. However, such strong assumptions are unrealistic: therefore, literature has started considering fault-prone systems. Various types of faults have been considered in the LCM model [15]: robots may suffer from imprecise perceptions [9], inaccurate or non-rigid movements [2, 24], byzantine behaviors [1, 13, 14, 22], or crash faults after which they stop working forever [6, 13, 26]. For this purpose, fault-tolerant algorithms have been designed to solve problems under fault-prone systems.

In this work, we focus on systems prone to crash faults, i.e., where robots may stop working irremediably and thus do not execute their LCM cycle when activated. Let us go back to the introduction example: in the case of a firefighter swarm, a lower number of functioning robots than the scheduled number may be insufficient to fight a possible fire. Due to the criticism of the scenario, all the robots should be ready to intervene, thus it is important to design trusty systems where the not-crashed (aka, correct) robots can promptly and autonomously detect whether there exists a crashed robot and, in this case, send an external message to request human intervention and maintenance. This illustrative scenario exemplifies the importance of a distributed system to be able to autonomously detect and/or identify the presence of faulty entities within the system. Different approaches can be theorized to be applied when a fault is correctly detected:

  • robots stop working to avoid compromising the integrity of the system (swarm freezing);

  • robots try to solve the original task (fault tolerance) or a weaker version of it (best effort);

  • robots solve a different problem, e.g., try to fix faulty robots (fault recovery).

Note that freezing the swarm may be a prior step to fault recovery. Preliminary to these approaches, robots may have to solve the Fault Detection (FD) and Fault Identification (FI) problems. A swarm solves Fault Detection if the correct robots detect whether there exists at least a faulty robot in the swarm; Fault Identification requires the correct robots to identify which robots are correct and which are faulty. Here, we propose an initial study for the FD and FI problems under the LCM model, paving the way for further work that may study the consequent reactions to recover the systems from faults.

Related work and our contribution.

The research in faulty-prone robot swarms has focused on providing fault-tolerant algorithms for the classic problems [4, 5, 6, 12, 13, 15, 25, 26, 28, 30]. Yet, a fault-tolerant algorithm does not necessarily contain a fault detection module, hence it does not assume that the correct robots are aware of the presence of the faulty ones.

The concept of failure detector was first introduced by Chandra and Toueg in [8] for solving Consensus in asynchronous systems prone to crash faults. Although the authors considered a different distributed model than the LCM one (i.e., static system of n processes, each one equipped with a specific algorithm, and completely connected through trusty message channels), their work defines some properties (e.g., strong/weak completeness and accuracy) that a failure detector should reasonably enjoy, independently of the distributed system to which is applied. For the LCM model, the idea of failure detector has been considered in [28, 30], where the authors provide some crash-fault detector modules as part of solutions for the Flocking problem111The Flocking problem requires robots to move together on the plane while maintaining a pattern.. Thus, their failure detectors are strictly based on the algorithms provided for Flocking. Their idea is to make correct robots compare the positions of the others, and thus understand when a robot has crashed since it has not changed position for too many activations. For this comparison, robots are allowed to store a variable SPosPrevObser which contains the set of positions of robots in the system during the last previous observation, thus granting each robot an infinite persistent memory222If the swarm is composed of n robots, SPosPrevObser(2)n..

In this paper, we embark on a general study of how a swarm of robots can detect whether there are faulty robots in the swarm, and, possibly, identify them, independently of the primary problem the swarm is solving. Note that we aim to detect/identify faults even when the swarm has reached a static configuration since it has terminated its task or because the swarm must stay still in a particular configuration (e.g., a regular polygon and a point). One can criticize the sense of detecting faults in a swarm where apparently robots have to do nothing. The motivation lies in the fact that robots may be required to stay in a particular configuration ready to be activated by an event scheduler simulating an external event occurring in the environment (e.g., a fire for the firefighter swarm). Always being correct is a key requirement for trustworthy and fault-aware systems that can aptly detect and react to a failure.

Preparatory to our investigation, the first part of this work formalizes the occurrence of crash faults and defines the Fault Detection (FD) and Fault Identification (FI) problems. As in [8], we define two properties (i.e., reliability and punctuality) that a FD or FI algorithm should enjoy; yet, differently from [8], we work only on algorithms that never provide false positives, and we analyze the delays within which a fault is detected/identified. We take into account 12 robot models (𝒪𝒪𝒯, 𝒮𝒯𝒜, 𝒞𝒪, 𝒰, combined with the modes FSYNCH, RROBIN, SSYNCH), and we study generic solutions for FD or FI under such models. Note that, differently from [8, 28, 30], we consider models where robots can store and communicate at most a constant-size amount of data, thus they can neither use their persistent memories to track the past snapshots of the swarm [28, 30], nor send the list of their suspects to the other robots [8]. We prove a reliable FI procedure cannot exist under 𝒰FSYNCH. Instead, we provide three FD procedures for three models: 𝒰FSYNCH (punctual), 𝒞𝒪FSYNCH (punctual), and 𝒰RROBIN (reliable). We prove the impossibility of designing a reliable or punctual FD procedure under the other 9 models. Due to a lack of space, the proofs of the majority of our results have been moved to Appendix A. To the best of our knowledge, this is the first work addressing fault detection for the LCM model using a general approach.

2 Preliminaries

2.1 Models

This work investigates how FD and FI can be solved under different robot models. All the models we consider present some core features, while they differ in a set of variable features.

Core features.

A swarm of robots is modeled as a set of autonomous mobile robots ={r0,,rn1} acting in the Euclidean plane 2. Robots are indistinguishable (they cannot be distinguished by external appearance), anonymous (they are not provided with any id), homogeneous (they execute the same algorithm), and punctiform entities. Each robot has a local coordinate system and a sensor through which it perceives all the robots in the plane. Each robot r has a light 𝚕𝚒𝚐r whose color is chosen from a constant-size palette 𝔏. We assume that 𝔏 contains at least the blank color /b, the default one. All the robots in the swarm are provided with the same deterministic algorithm, which is executed every time a robot is activated. The algorithm also defines the colors in 𝔏 and a total order of 𝔏; such colors are used to update the light of each robot. By default, a robot is idle. When activated by the activation scheduler, a robot r executes a Look-Compute-Move cycle: it takes the snapshot of the whole swarm (Look), it executes the algorithm using the sole snapshot as input (Compute), and it can update the color of 𝚕𝚒𝚐r and subsequently it travels along a straight trajectory towards the computed destination (Move). After that, it returns idle, so that it can be activated by the scheduler again. The snapshot of r contains the positions (according to the local coordinate system of r) and colors of all the robots of the swarm: no other information about the swarm can be perceived (e.g., whether robots are idle/active, still/moving). We assume rigid movements, i.e., each robot reaches its computed destination during its Move step. The local coordinate systems of the robots in the swarm may not be the same (disoriented robots); moreover, the local coordinate system of any robot may not persist from one activation to another (variable disorientation). Robots are allowed to form multiplicities, i.e., to occupy the same point of the plane at the same time. However, we assume strong multiplicity detection, so that each robot can see all the robots (and their colors) which form a multiplicity.

Variable features.

We now present the features under study: memory, communication, and synchronization. Regarding robots’ memory and communication capabilities, we consider the four models mainly proposed in the literature. Formally, such differences are implemented in the visibility of the robot lights which can be used as a persistent memory and/or a communication means. In the 𝒪𝒪𝒯 model (w/o memory, w/o communication), the robots’ lights are visible to no robot in the swarm (equivalently, 𝔏={/b} for any algorithm). In the 𝒮𝒯𝒜 model (with memory, w/o communication), 𝚕𝚒𝚐r is visible only to r, for each r. In the 𝒞𝒪 model (w/o memory, with communication), 𝚕𝚒𝚐r is visible to all robots but r. In the 𝒰 model (with memory, with communication), each 𝚕𝚒𝚐r is visible to each robot in . Regarding the synchronization of robots, we consider only synchronous modes, so that time is divided into atomic rounds; a subset of robots is activated at each round, and they perform their LCM cycle in perfect synchrony. In particular, in the semi-synchronous mode (SSYNCH) an arbitrary non-empty subset of robots is activated at each round. We always assume the fairness condition: for each time t and for each robot r, there exists a finite time t>t when r is activated. The fairness condition allows us to measure time as a sequence of consecutive and finite epochs: an epoch is the minimal time frame within which all robots are activated at least once. We consider two sub-modes of SSYNCH. In the fully synchronous mode (FSYNCH), the whole swarm is activated at each round. In the round-robin mode (RROBIN), all robots are activated in sequence, one robot in each round, before being reactivated in the same order. An epoch always lasts one round for FSYNCH and n rounds for RROBIN. The choice of the subset of robots activated at every time is made by an activation scheduler.

Definition 1 (Activation Scheduler).

Given a swarm of robots , an activation scheduler is a function 𝔄:2 defining the subset of the swarm 𝔄(t) activated at time333The time domain we consider is , including 0. t0.

Notation.

We use the notation XY to indicate a model for robots that possess all the above core features and that has X{𝒪𝒪𝒯,𝒮𝒯𝒜,𝒞𝒪,𝒰} as communication-memory setting and Y{𝚂,𝙵,𝚁𝚁} as synchronization mode (i.e., SSYNCH, FSYNCH, RROBIN, resp.). We indicate with the set of the resulting 12 robot models. For a model XY, we will consider both the fault-free version (i.e., where all robots are assumed to be always correct) and the fault-prone version (i.e., where robots may suffer from crash faults).

Configurations and snapshots.

Given a global coordinate system Ξ^ on 2 (unknown by the swarm), we define the global state of a robot ri at time t as the pair ζ^i=(x¯^i,c^i) where x¯^i2 is the position of ri according to Ξ^, and c^i𝔏 is the light color of ri, at time t. If ri performs an LCM cycle at time t and ri changes its global state in (x¯^i,c^i) in the related Move step, we assume that such change becomes effective at time t+1. If a group of #>0 robots have the same state ζ^ at time t, we say they form a (global) color-multiplicity, denoted as {ζ^}#. A configuration of the swarm at time t is the set C={{ζ^0}#0,,{ζ^k}#k} defining all the distinct color-multiplicities that exist at time t (thus, j=0k#j=n).

From a robot’s point of view, the configuration of the swarm can be perceived differently: as a matter of fact, each robot r has its own local (and not persistent) coordinate system, and the color of the robots’ lights may be not visible to r. By convention, we always assume that when a light is not visible to a robot r, then r sees such light as /b-colored444Note that r may not see even its own light.. Let Ξ be the local coordinate system of r at time t. A local color-multiplicity {ζ}# is a set of #>0 robots with the same local state ζ=(x¯,c), where x¯ is their position according to Ξ, while c is their light color seen by r. A snapshot can be described as a “local configuration” from the point of view of r. Thus, the snapshot of r at time t is the tuple {ζ0}#0,,{ζh}#h containing all the distinct local color-multiplicities seen by r. We always assume that the first color-multiplicity {ζ0}#0 of a snapshot represents the local color-multiplicity containing the robot r itself. Note that a snapshot is represented and managed as an ordered data structure: such an order allows an activated robot to distinguish its state ζ0 from the others ζi>0 (which are ordered according to the local coordinate system of r and the total order of the colors in 𝔏).

2.2 Crash fault schedulers

A crash fault on a robot r at time t implies that r will no longer execute an LCM cycle from time t onward. In other words, r remains idle for any time tt. Until t, r is said to be correct. According to this, we formalize the occurrence of crash faults for a swarm of robots through a scheduler of crashes.

Definition 2 (Crash Scheduler).

A crash scheduler for a swarm is a function 𝔠:{} defining for each robot r the time where r crashes (i.e., stops working). If 𝔠(r)=, r never crashes.

We say that a crash scheduler 𝔠 is k-survival if |{r𝔠(r)=}|k, i.e., it keeps at least k robots always correct. We always assume that a crash scheduler is at least 1-survival.

Definition 3 (Suppression Scheduler).

Given a crash scheduler 𝔠 defined for a swarm , we define the related suppression scheduler as the function 𝔠:2 so that

𝔠(t):={r𝔠(r)t}.

For a given time t, 𝔠(t) is the set of robots crashed at a time tt according to 𝔠. Obviously, it holds that 𝔠(t)𝔠(t). When no ambiguity arises, we will use instead of 𝔠 by omitting to indicate the related crash scheduler.

A swarm is governed by both an activation scheduler and a suppression scheduler, which operate simultaneously, each independently of the other. Hence, it is convenient to define the behavior of the combined scheduler which determines, for each time t, the subset of robots that will actually become active and start performing their LCM cycle at time t.

Definition 4 (Combined Scheduler).

Given 𝔄, respectively an activation and a suppression scheduler defined on the same swarm , the combined scheduler is the function 𝔄:2 defined as

𝔄(t)𝔄(t)(t).

Note that, despite 𝔄 is always assumed fair, a combined scheduler 𝔄 guarantees fairness iff (t)= for each time t.

3 Fault Detection and Fault Identification

3.1 Problems and procedures

Let ={r0,,rn1} be a swarm of robots executing a given distributed algorithm 𝔸 for solving a problem P under a fault-free model M. If we consider the same swarm and the same problem under the fault-prone version of M, we are interested in making robots concurrently cope with two problems: the primary problem P, and the Fault Detection problem. The Fault Detection problem (FD) is an agreement-related problem asking for a swarm of robots to reach awareness about the presence of a faulty robot if it exists. Basically, the swarm is required to behave in this way:

  • if the first crash fault has occurred at time tcrash, there must exist a finite time tdetect when at least a robot detects there exists a fault and stays still. Afterward, all the other (correct) robots must reach the awareness of the presence of a fault and stay still;

  • until tdetect, the robots have to continue solving the primary problem by executing 𝔸.

As for the other agreement-related problems typical of distributed systems (e.g., Consensus, Broadcast, Leader Election), FD aims to make all robots in the swarm agree on the same information (i.e., the absence or the presence of a faulty robot) and the same reaction (i.e., solve the primary problem or freeze the swarm, resp.). The awareness of a crash can be formally implemented in each robot by setting a local boolean variable z whose value is updated at each LCM cycle by running a Fault Detection procedure during the Compute step.

    Fault Detection procedure
    Input: A snapshot σ of the swarm.
    Output: z𝖥𝖠𝖴𝖫𝖳 if a faulty robot is detected in σ; z𝖢𝖮𝖱𝖱𝖤𝖢𝖳 otherwise.

Formally, if a crash fault has occurred for the first time at tcrash, then tdetect is the first time when at least one robot detects the fault by executing a FD procedure, sets z𝖥𝖠𝖴𝖫𝖳, and stays still. Accordingly, a finite time tagreetdetect exists where all the correct robots of the swarm agree on the presence of the fault by setting their variable z to 𝖥𝖠𝖴𝖫𝖳 and thus stay still. As we will prove in this paper, there could be an intrinsic delay between tcrash and tdetect when the first robot detects that fault and/or tagree when the swarm reaches the agreement on that fault (thus solving FD).

The Fault Identification problem (FI) is more sophisticated than FD since it requires robots to distinguish the faulty robots from the correct ones in the swarm. Formally, if a robot ri has crashed at time tcrashi, then FI requires that there exists a finite time tdetecti when at least a robot identifies ri as faulty. Afterward, as for FD, all the correct robots are required to converge on the information “ri is faulty” by a finite time tagreeitdetecti.

To solve FI, each robot queries a Fault Identification procedure, which returns a vector z¯ so that zj=𝖥𝖠𝖴𝖫𝖳 (zj=𝖢𝖮𝖱𝖱𝖤𝖢𝖳, resp.) if the corresponding j-th robot in the snapshot σ has been identified as faulty (correct, resp.). Indeed, a FI procedure simulates a FD one. For the sake of clarity, we will use the function-like notation z¯(ri) to indicate the value in z¯ which refers to the robot ri: this allows us to ignore the local index of ri according to the snapshot σ.

    Fault Identification procedure
    Input: A snapshot σ of the swarm ={r0,,rn1}.
    Output: z¯{𝖥𝖠𝖴𝖫𝖳,𝖢𝖮𝖱𝖱𝖤𝖢𝖳}n, where z¯(ri)=𝖥𝖠𝖴𝖫𝖳 iff ri is identified as faulty in σ.

3.2 Reliability and puntuality

We here define and analyze some properties about FD/FI procedures. Let be a FD procedure for ={r0,,rn1}. We define Δ=tcrashtdetect as the detection delay, where tcrash is the time of the first crash in , and tdetect is the first time when returns 𝖥𝖠𝖴𝖫𝖳. If is a FI procedure, we define Δi=𝔠(ri)ti as the identification delay for a robot ri, where 𝔠(ri) is the crash time of ri and ti is the first time returns z¯(ri)=𝖥𝖠𝖴𝖫𝖳. We say that Δ and Δi are undefined if the related faults have never occurred.

We firstly state that, if is a FD (FI, resp.) procedure that never outputs a false positive, then a fault in the swarm (of ri, resp.) cannot be detected (identified, resp.) in the same round of its occurrence. This means that Δ>0 (Δi>0, resp.). We prove the statement for FD; the proof for FI is similar.

Lemma 5 (No Instantaneous Detection).

Let be a FD procedure which never outputs false positives. Then, if tcrash is the time of the first crash, then returns 𝖢𝖮𝖱𝖱𝖤𝖢𝖳 at time tcrash.

We say that a FD/FI procedure guarantees reliability if:

  • (i) (no false positive) it never detects/identifies a fault when it does not exist in the swarm (for FD) or in a particular robot (for FI);

  • (ii) (finite delay) it detects/identifies a fault with a finite delay after its occurrence;

  • (iii) (coherence) once detected/identified, it continues to detect/identify the fault in the following rounds.

Formally, let be a FD procedure for a swarm ={r0,,rn1}. We say that is reliable when, given any crash scheduler 𝔠 for , the following conditions hold:

  • (i) if returns 𝖥𝖠𝖴𝖫𝖳 at time t then 𝔠(t1);

  • (ii,iii) if tcrash=min{𝔠(ri)ri} then tdetect>tcrashtdetect and returns 𝖥𝖠𝖴𝖫𝖳 for any time ttdetect.

At the same time, if is a FI procedure, we say that it is reliable if we have:

  • (i) if returns z¯(ri)=𝖥𝖠𝖴𝖫𝖳 with i{0,,n1} at time t then 𝔠(ri)<t;

  • (ii,iii) if 𝔠(ri)=tcrash then ti>tcrashti so that returns z¯(ri)=𝖥𝖠𝖴𝖫𝖳 at any time tti.

We say that a FD/FI procedure is punctual if it is reliable and it holds the stronger condition that:

  • if 𝔠(ri)=tcrash then returns 𝖥𝖠𝖴𝖫𝖳 (for FD) or z¯(ri)=𝖥𝖠𝖴𝖫𝖳 (for FI) for any time ttcrash+1.

In other words, is punctual when it detects/identifies the fault starting from the subsequent round after a crash occurrence; thus, it never outputs false negatives (starting from tcrash+1).

3.3 Time conditions

In the previous section, we proved that a fault cannot be detected/identified at the same round when it occurs. Here, we provide further time conditions on when a fault can or cannot be detected/identified.

Definition 6 (Missing Activation).

Let 𝔄, be an activation and a suppression scheduler for a swarm , respectively. We say that there is a missing activation of r at time t if r𝔄(t)(t). We denote with 𝔣𝔪𝔞(r) the time of the first missing activation of r.

Definition 7 (Fault Ignorance Time, FIT).

Let 𝔄,𝔠 be an activation and a suppression scheduler for a swarm , respectively. Let r be a robot whose crash time is 𝔠(r) and 𝔣𝔪𝔞(r) is its first missing activation time, with 𝔣𝔪𝔞(r)𝔠(r). We define the Fault Ignorance Time (FIT) of r as the interval (r)=[𝔠(r),𝔣𝔪𝔞(r)].

The following lemma holds since, during (r), r is faulty but has not yet missed an activation according to 𝔄, thus its behavior would have been the same as if it had been correct. The related corollary imposes a lower bound on the time for r to be identified as faulty, without guaranteeing the actual possibility of identifying such a fault.

Lemma 8 (Missing Activation Rule).

Let r be a crashed robot. Then, the crash of r cannot be identified during (r) by a reliable FI procedure.

Corollary 9.

The crash of a robot r cannot reliably be identified at time t𝔣𝔪𝔞(r).

The Missing Activation Rule can be generalized for an FD procedure.

Lemma 10.

Let tdetect be the first time when a reliable FD procedure detects a fault in a swarm . Then, tdetect>minr{𝔣𝔪𝔞(r)}.

3.4 Combined Algorithm

Besides the reliability feature, we aim to design FD or FI procedures that are the most general as possible, i.e., which can detect/identify faults independently of the primary problem that a swarm must solve. Let M be a faulty-prone model: we say that M is a model-dependent FD (FI, resp.) procedure if it reliably solves FD (FI, resp.) under M for any swarm and any primary problem the swarm can solve. Thus, let 𝔸 be a solving algorithm for a (primary) problem P under the fault-free version of M. We want to combine 𝔸 with M so that the combined algorithm Ψ(𝔸,M) becomes the new algorithm executed by each robot in its Compute step. The combined algorithm Ψ(𝔸,M) must behave in this way:

  • Correctness: Ψ(𝔸,M) must continue solving P until no crash is detected/identified. If no crash occurs, Ψ(𝔸,M) eventually solves P.

  • Soundness: Ψ(𝔸,M) must freeze the swarm (i.e., no robot will move anymore) when a crash is detected/identified. If a crash occurs, Ψ(𝔸,M) detects/identifies it after a finite delay after its occurrence, and will continue to detect/identify it in the next rounds.

Augmented snapshots.

To solve FD/FI, M may adopt an ad hoc palette of colors 𝔏 during the evolution of Ψ(𝔸,M). For the sake of clarity and w.l.o.g., we can assume that, under the fault-prone model M, each robot r is embedded with two lights, the primary light 𝚕𝚒𝚐r which assumes the colors in 𝔏 provided by the primary algorithm 𝔸, and the service light 𝚜𝚎𝚛r which assumes the colors in 𝔏 provided by M. So, the global state of a robot r is the triple ζ^=(x¯^,c^,f^) where x¯^ is its position, c^𝔏 is the color of 𝚕𝚒𝚐r and f^𝔏 is the color of 𝚜𝚎𝚛r. For the sake of uniformity, we assume 𝔏={/b} under 𝒪𝒪𝒯; moreover, when the light of r is not visible to a robot, then this robot will see 𝚜𝚎𝚛r (as well as 𝚕𝚒𝚐r) as /b-colored. Thus, a combined algorithm Ψ(𝔸,M) takes in input an augmented snapshot σ={ζ0}#0,,{ζh}#h with ζi=(x¯i,ci,fi), and returns a triple (x¯,c,f) where x¯ is the position to be reached, and c (f, resp.) is the possible color of the primary (service, resp.) light to be set during the related Move step555Remind that hatted and unhatted notations are used to distinguish the global states of robots from the local views of them.. Remind that a robot may not update the color of its lights; in that case, the value c (f, resp.) returned by the algorithm will be denoted with the symbol .

Algorithm 1 Combined algorithm Ψ(𝔸,M) where M is a model-dependent FI procedure for a model M.

Let us present our combined algorithm Ψ, which uses 𝔸 and M as black-box procedures for its implementation. The pseudo-code of Ψ is listed in Algorithm 1. We here assume that M is a FI procedure (if it were a FD procedure, the implementation would be similar). Let r be an activated robot in a swarm of n robots that executes the algorithm Ψ using the just taken snapshot σ as input. At first, Ψ runs the FI procedure M(σ), which outputs the vector z¯{𝖢𝖮𝖱𝖱𝖤𝖢𝖳,𝖥𝖠𝖴𝖫𝖳}n, and the new color f𝔏 to be set for the service light of r. Then Ψ implements two different actions according to the value of z¯:

  • (correctess): If the vector z¯ contains only 𝖢𝖮𝖱𝖱𝖤𝖢𝖳 values, Ψ computes the restricted snapshot σ𝔸 that is obtained by σ by removing the service color fi from any state ζi. Then, Ψ runs the primary algorithm 𝔸(σ𝔸) and obtains the pair (x¯,c), i.e., the possibly new position and primary color for r: this result perfectly simulates the behavior of 𝔸 in the solution of the primary problem under the fault-free version of M.

  • (soundness): Otherwise, r is aware that some robots are faulty in the swarm, and it can identify them. In this case, Ψ returns the triple (x¯0,,f) where x¯0 is the current position r (i.e., r freezes itself).

This algorithm generalizes the case when M is a FD procedure: in that case, z¯ consists of only one element.

The next theorem allows the use of M or Ψ(𝔸,M) interchangeably in the following sections.

Theorem 11.

M is a reliable model-dependent FD or FI procedure for a model M if and only if Ψ(𝔸,M) guarantees correctness and soundness for any primary algorithm 𝔸 under M.

3.5 Fault Identification unreliability

Here, we present a witness problem through which we prove the impossibility of designing a reliable model-dependent FI procedure for 𝒰𝙵.

Problem 1 (SuperPoint).

Consider n robots lying on the same point p of the Euclidean plane. The problem asks them never to change position666The motivation of this kind of problem can be found in the Introduction..

The SuperPoint problem can be trivially solved under any fault-free model M thanks to a NOP algorithm 𝔸NOP (basically, robots do nothing)777Reference to the assembly instruction NOP, i.e., No Operation.. Let Ξ^ be a coordinate system whose origin lies in p. Let be a hypothetical reliable 𝒰𝙵-dependent FI procedure. In the following, we state some properties about Ψ(𝔸NOP,) while solving SuperPoint, assuming the local coordinate system of all the robots is always Ξ^.

Lemma 12.

Let be a swarm executing Ψ(𝔸NOP,) to solve SuperPoint as primary problem, where is a reliable model-dependent FI procedure for 𝒰𝙵. Let us assume all the robots always use the coordinate system Ξ^, and let be a suppression scheduler for . Then at each round t, all the robots in (t1) start their round with the same color.

Proof.

Let 𝔏 be the palette used by 𝔸NOP to solve SuperPoint under the fault-free 𝒰𝙵 model888Although the most efficient NOP algorithm can solve the problem with 𝔏={/b}, we here assume nothing else about 𝔸NOP other than that it effectively solves SuperPoint.. Suppose 𝔏 is the palette used by . By hypothesis, the snapshot of a robot r has this form: σ={(𝐎,b0)}#0,,{(𝐎,bh)}#h where 𝐎=(0,0) is the position of any robot according to Ξ^, and where bi=(ci,fi)𝔏×𝔏 represents the pair of colors for the primary light and the service light of the robots. Note that b0 represents the colors of the robot r itself. The lemma can be easily proved by induction: in fact, at time t=0, all the robots have the same color bi=(/b,/b) (by convention, we set that (1)=). Let us now assume the fact true for a time t, i.e., all the robots in (t1) have the same color, say 𝐛, during the t-th round. Since (t)(t1), all the active robots in the t-th Look step take the same snapshot σ={(𝐎,𝐛)}#0,,{(𝐎,bh)}#h, and accordingly they compute the same next color 𝐛 by executing Ψ(𝔸NOP,)(σ). Thus, at the beginning of the (t+1)-th round, all the robots in (t) have set the same color 𝐛. By the contrapositive of Section 3.5, we have:

Corollary 13.

Let be a reliable FI model-dependent procedure for 𝒰𝙵. Let us consider the problem SuperPoint for a swarm of robots executing and let us assume all the robots always use the coordinate system Ξ^. If at the t-th round two robots have two different colors, then at least one belongs to (t1).

Theorem 14.

There can be no reliable model-dependent FI procedure for 𝒰𝙵.

Proof.

By contradiction, let us assume that there exists a reliable FI procedure for 𝒰𝙵 which uses a k-size palette 𝔏. Let us consider the SuperPoint problem under 𝒰𝙵, with nk+2 robots, and let 𝔸NOP be the optimal algorithm solving SuperPoint with 𝔏={/b}. By definition, the problem starts from a configuration where all the robots have the same color. Let us assume all the robots always use the coordinate system Ξ^. Thus, for the sake of conciseness, we can omit to state the position (0,0) and the primary light color /b in the snapshots of the robots; instead, we only specify the colors of the service lights. As proved in Section 3.5, at each round t, all the active and correct robots take the same snapshot σ={f0}#0,,{fh}#h, execute (σ)=(z¯,f) and update their service color from f0 to f. Note in fact that f0 corresponds to the current service color of all the robots in (t1) (and thus in (t)).

Let be a suppression scheduler so that it crashes a number of robots [k,n2] in k different rounds within a finite time tk. Formally, there exists a series of k finite crash times t1<<tk such that (ti)(ti+1), and such that |(tk)|[k,n2]. Let t^ be the first time when the robots in (t^) execute (σ)=(z¯,𝐟) and 𝐟 is a color already present in the current snapshot σ. In other words, there is at least a faulty 𝐟-colored robot in the swarm at time t^. Note that this time exists and it holds t^tk. In fact, the worst case (i.e., t^=tk) is reached when, at each crash time ti, outputs a new service color not yet present in the swarm: after the k-th crash time tk, the swarm contains all the k colors of 𝔏. So, at time t^, the correct robots set their color as 𝐟.

Let r be an active robot at time t^+1 and let {𝐟}#0,,{fh}#h be the snapshot r takes. Note that, since the correct robots at time t^ are at least 2 by hypothesis, then #0>2. By applying , r cannot distinguish which among the #0 𝐟-colored robots are faulty or correct. This contradicts the hypothesis that is reliable.

4 Model-dependent Fault Detection

In this section, we present three model-dependent FD procedures respectively for 𝒰𝙵, 𝒞𝒪𝙵 and 𝒰𝚁𝚁. In all of them, we assume that the related 𝔏 always contains a special color ALARM which will be used as soon as a robot detects the presence of a faulty robot during its Compute step. Accordingly, in the related Move step, the robot will set its service light as ALARM to notify itself and/or the other robots that a faulty robot has been detected. When a robot r sees an ALARM in its snapshot, it automatically is aware that a fault has been detected in the previous rounds. Thus, r sets its 𝚜𝚎𝚛r to ALARM too, whereas it updates neither its position nor its 𝚕𝚒𝚐r: this causes the alarm to be broadcast throughout the correct robots of the swarm, and the swarm to freeze at the same time.

Eventually in Section 4.4, we prove that a reliable FD procedure cannot exist under the other robot models. Table 1 summaries the properties of our proposed model-dependent FD procedures.

Table 1: Proposed model-dependent FD procedures. X{𝒪𝒪𝒯,𝒮𝒯𝒜,𝒞𝒪,𝒰}, Y{𝙵,𝚁𝚁,𝚂}.
Model Reliable Min Survivals Detection Delay |𝔏|
𝒰𝙵 Yes 1 Δ=1 (Punctual) 3
𝒞𝒪𝙵 Yes 2 Δ=1 (Punctual) 3
𝒰𝚁𝚁 Yes 1 1Δ2n1 4
𝒞𝒪𝙵 Impossible 1
𝒞𝒪𝚁𝚁 Impossible 2
X𝚂 Impossible
{𝒮𝒯𝒜,𝒪𝒪𝒯}Y Impossible

4.1 Fault detection in 𝓛𝓤𝓜𝓘𝙵

Let us present a simple reliable FD procedure 𝒰𝙵 which can be executed by any swarm ={r0,,rn1} under 𝒰𝙵. This procedure tolerates up to n1 crash faults and uses 𝔏=2{ALARM} as the service palette, 0 being the initial value. Working under FSYNCH, all the correct robots become active simultaneously, and the 𝒰𝙵 procedure makes them update their service light cycling over the values of 2={0,1}. Let tcrash=mini{𝔠(ri)} be the first time when a robot crashes according to a crash time scheduler 𝔠 (which is 1-survival). Note that tcrash can be . Then, all the robots synchronously alternate their service colors during the rounds before tcrash. If one robot crashes at time tcrash and so it does not update its light, its failure will be detected in the successive round, as it will display a different light than any other activated robot. Let r be an activated robot at time tcrash+1. So r detects from its snapshot that all the robots with a service light different from itself have crashed at the previous round, and thus sets 𝚜𝚎𝚛r as ALARM. Because of this immediate detection, the proposed procedure is punctual. Note that, although r can identify the crashed robots at time tcrash+1, this procedure is not a model-dependent FI procedure since it cannot guarantee reliability in identifying the faulty robots in the following rounds (see Theorem 14 for the impossibility general result). Algorithm 2 and Figure 1(a) show the proposed FD procedure and the color update diagram of 𝚜𝚎𝚛r, respectively. Note that in Figure 1(a) (as well as for the next diagrams) the ALARM node represents a trap state, so that no outgoing transition is permitted: this property ensures the proposed procedure is coherent.

Algorithm 2 A punctual FD procedure for 𝒰𝙵.
(a) Algorithm 2 for 𝒰𝙵.
(b) Algorithm 3 for 𝒞𝒪𝙵.
Figure 1: Diagrams showing the color update of 𝚜𝚎𝚛r according to the proposed FD procedures. The gray nodes represent the impossibility for r to see the color of its own 𝚜𝚎𝚛r (i.e., f0).

4.2 Fault detection in 𝓕𝓒𝓞𝓜𝙵

The fact that a robot under 𝒞𝒪 cannot see its own lights (𝚕𝚒𝚐 and 𝚜𝚎𝚛) imposes the first main difference in the number of tolerated crashes with respect to the previous FD procedure.

Theorem 15.

There can be no reliable FD procedure for 𝒞𝒪𝙵 assuming 1-survival crash schedulers.

Assuming that no more than n2 robots can crash, we can aptly apply some slight but necessary changes to Algorithm 2 and provide a punctual FD procedure for 𝒞𝒪𝙵 (see Algorithm 3 and the corresponding color diagram in Figure 1(b)). Notably, robots alternate their service colors in 2: a robot can promptly detect a fault and set its service light to ALARM as soon as it sees two robots (different from itself) with two different service colors.

Algorithm 3 A punctual FD procedure for 𝒞𝒪𝙵.

4.3 Fault detection in 𝓛𝓤𝓜𝓘𝚁𝚁

Under 𝒰𝚁𝚁, we can design a reliable FD procedure even considering 1-survival crash time schedulers. However, we prove that designing a punctual FD procedure for this model is impossible.

Theorem 16.

There can be no punctual FD procedure for 𝒰𝚁𝚁, even considering (n1)-survival crash time schedulers.

Algorithm 4 A reliable FD procedure under 𝒰𝚁𝚁.

We now present our reliable FD procedure 𝒰𝚁𝚁 (see Algorithm 4 and Figure 2). In this case, we need one more value in the palette 𝔏: in particular, we make our robots cyclically take any value in {0,1,2} (i.e., 3), 0 being the default value. Note that it is possible to define a cyclic order [0,1,2] on 3, which would be impossible in 2; this order defines our color transition rule i(i+1)mod3. Thus, during the i-th epoch (with i0), our strategy is to make each activated robot set its 𝚜𝚎𝚛r from imod3 to (i+1)mod3.

Suppose a robot r crashes at its activation during the i-th epoch. So, its 𝚜𝚎𝚛r remains outdated to imod3. During the remainder of the i-th epoch, all the activated robots see robots whose service lights are colored as either imod3 or (i+1)mod3, thus they may have no means to understand if r has crashed or has not been activated yet. So they update their service light to (i+1)mod3. Let r be the first robot activated during the (i+1)-th epoch. Now, r can detect the fault since it sees a robot whose service light has the preceding value (according to the cyclic order [0,1,2]) than its own. So r sets its service light to ALARM, thus starting the agreement process among the other correct robots; in fact, in the following rounds, all the correct robots will set the ALARM light too, freezing the swarm.

To compute the maximum detection delay of 𝒰𝚁𝚁, consider the worst case. If (r0,,rn1) is the round-robin sequence by which robots are activated, suppose that all the robots except for rn1 crash during the first epoch, during their activation. So, the first crash (i.e., of r0) will be detected by rn1 in the second epoch, thus 2n1 rounds later. Instead, the minimum delay occurs when rn1 crashes (at the end of an epoch) and r0 detects its fault in the following round (at the beginning of the subsequent epoch).

Figure 2: Color update of 𝚜𝚎𝚛r in Algorithm 4.

4.4 Impossibility results

Using a similar argument as in Theorem 15, we prove that a reliable FD procedure for 𝒞𝒪𝚁𝚁 cannot exist assuming 1-survival crash schedulers. However, unlike 𝒞𝒪𝙵, such impossibility holds even with 2-survival crash schedulers.

Theorem 17.

There can be no reliable FD procedure for 𝒞𝒪𝚁𝚁 even assuming 2-survival crash schedulers.

The following theorem states the impossibility of distinguishing a not-yet-activated robot from a failed one in the SSYNCH mode.

Theorem 18.

There can be no reliable FD procedure for a model M under SSYNCH.

Problem 2 (Flip-Flop).

Consider two robots p,q lying on a point O and a third robot r lying on a distinct point x. Robot r must perform these two movements perpetually: reach the symmetric point x w.r.t. O, and then come back to x.

Figure 3: Flip-Flop problem.
Theorem 19.

There can be no reliable FD procedure for a model M under 𝒮𝒯𝒜.

Proof.

By contradiction, let be a reliable FD procedure using a palette 𝔏={f0,,fk1} of k1 colors for an 𝒮𝒯𝒜 model. We show that there exists a problem for which is not reliable. Consider the problem Flip-Flop that can be trivially solved under any fault-free 𝒮𝒯𝒜 model without the use of colors: let 𝔸 be a solving algorithm with 𝔏={/b}. It is self-evident that the crash of a robot which is required to be still by the primary problem cannot be detected without assuming communication. In fact, r can never detect whether a robot in O has crashed. However, this impossibility is not limited to the static robots: we prove that p (the same holds for q) cannot reliably detect the fault of r.

Let us denote the following coordinates: 𝐎=(0,0), 𝟏=(1,0), and 𝟏=(1,0). Let Ξ+ be a coordinate system which makes a robot perceives O in position 𝐎 and x in position 𝟏 (and thus x in position 𝟏), and let Ξ be a coordinate system under which O is in position 𝐎 and x in 𝟏 (and thus x in 𝟏). Let 𝔄 be an activation scheduler (any under SSYNCH,FSYNCH,RROBIN) for p,q,r. Consider the swarm under the combined scheduler 𝔄 where is a suppression scheduler that crashes no robot. Suppose p is activated at times t0,t1,t2,, and suppose that its coordinate system is Ξ+ when r lies on x and Ξ when r lies on x. In other words, p always perceives r in position 𝟏. Since is reliable and no robot has crashed, p will always obtain 𝖢𝖮𝖱𝖱𝖤𝖢𝖳 by executing at times t0,t1,. In particular, let σ0,σ1,,σh be the set of the snapshots taken with some periodicity by p along its activations t0,t1,, where 0h<k and where σi has this form999For the sake of conciseness, we here omit the notation for multiplicities.: (𝐎,/b,fi),(𝐎,/b,/b),(𝟏,/b,/b).

Now, consider the same swarm under the combined scheduler 𝔄 where only crashes r at time t0. W.l.o.g. suppose that r lies on x at t0, and suppose that from t0 onward p will always have Ξ+ as coordinate system, thus perceiving r in position 𝟏. Thus p will be activated at times t0,t1,t2, and it will perform (σ0),(σ1),(σh) at the same times and periodicity as under 𝔄, thus never detecting the fault of r. Contradiction achieved.

Corollary 20.

There can be no reliable FD procedure for a model M under 𝒪𝒪𝒯.

5 Conclusions

In this paper, we have investigated the ability of correct robots to detect or identify the presence of crashed robots in fault-prone swarms. We have defined the problems associated with this purpose, namely Fault Detection (FD) and Fault Identification (FI), and we have studied the (im)possibility of solving FD or FI under 12 robot models (𝒪𝒪𝒯, 𝒮𝒯𝒜, 𝒞𝒪, 𝒰, combined with FSYNCH, RROBIN, SSYNCH). We have formally defined two properties (reliability and punctuality) that a FD or FI procedure should satisfy, and we have soon proved that a reliable FI procedure cannot exist under the strong model 𝒰𝙵. So, we have provided three reliable (or even punctual) FD procedures for the three models 𝒰𝙵,𝒞𝒪𝙵 and 𝒰𝚁𝚁; for the others 9 models, we have proved that it is impossible to design a reliable FD procedure.

In this vein, future works should investigate the (im)possibility of developing reliable FD or FI procedures for models not considered in our work (e.g., operating under k-bounded activation schedulers), or finding weaker properties (w.r.t. reliability and punctuality) under which it is possible to solve FD or FI. Furthermore, this preliminary study about FD and FI paves the way for subsequent work studying fault recovery distributed algorithms, i.e., where correct robots must fix the faulty ones.

References

  • [1] Yotam Ashkenazi, Shlomi Dolev, Sayaka Kamei, Fukuhito Ooshita, and Koichi Wada. Forgive & forget: Self-stabilizing swarms in spite of byzantine robots. In Proc. of 7th International Symposium on Computing and Networking, CANDAR, pages 188–194. IEEE, 2019. doi:10.1109/CANDARW.2019.00041.
  • [2] Kaustav Bose, Archak Das, and Buddhadeb Sau. Pattern formation by robots with inaccurate movements. In Proc. of 25th International Conference on Principles of Distributed Systems, OPODIS, volume 217 of LIPIcs, pages 10:1–10:20. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPICS.OPODIS.2021.10.
  • [3] Quentin Bramas, Stéphane Devismes, and Pascal Lafourcade. Infinite grid exploration by disoriented robots. In Proc. of 8th International Conference on Networked Systems, NETYS, volume 12129 of LNCS, pages 129–145. Springer, 2020. doi:10.1007/978-3-030-67087-0_9.
  • [4] Quentin Bramas, Sayaka Kamei, Anissa Lamani, and Sébastien Tixeuil. Stand-up indulgent gathering on lines. Theoretical Computer Science, 1016:114796, 2024. doi:10.1016/J.TCS.2024.114796.
  • [5] Quentin Bramas, Anissa Lamani, and Sébastien Tixeuil. Stand up indulgent gathering. Theoretical Computer Science, 939:63–77, 2023. doi:10.1016/j.tcs.2022.10.015.
  • [6] Quentin Bramas and Sébastien Tixeuil. Wait-free gathering without chirality. In Christian Scheideler, editor, Proc. of 22nd International Colloquium On Structural Information and Communication Complexity, SIROCCO, volume 9439 of LNCS, pages 313–327. Springer, 2015. doi:10.1007/978-3-319-25258-2_22.
  • [7] Davide Canepa and Maria Gradinariu Potop-Butucaru. Stabilizing flocking via leader election in robot networks. In Proc. of 9th International Symposium on Stabilization, Safety, and Security of Distributed Systems, SSS, volume 4838 of LNCS, pages 52–66. Springer, 2007. doi:10.1007/978-3-540-76627-8_7.
  • [8] Tushar Deepak Chandra and Sam Toueg. Unreliable failure detectors for reliable distributed systems. J. ACM, 43(2):225–267, 1996. doi:10.1145/226643.226647.
  • [9] Reuven Cohen and David Peleg. Convergence of autonomous mobile robots with inaccurate sensors and movements. SIAM J. Comput., 38(1):276–302, 2008. doi:10.1137/060665257.
  • [10] Gianlorenzo D’Angelo, Gabriele Di Stefano, Ralf Klasing, and Alfredo Navarra. Gathering of robots on anonymous grids and trees without multiplicity detection. Theoretical Computer Science, 610:158–168, 2016. doi:10.1016/J.TCS.2014.06.045.
  • [11] Shantanu Das, Paola Flocchini, Giuseppe Prencipe, and Nicola Santoro. Forming sequences of patterns with luminous robots. IEEE Access, 8:90577–90597, 2020. doi:10.1109/ACCESS.2020.2994052.
  • [12] Shantanu Das, Riccardo Focardi, Flaminia L. Luccio, Euripides Markou, and Marco Squarcina. Gathering of robots in a ring with mobile faults. Theoretical Computer Science, 764:42–60, 2019. doi:10.1016/J.TCS.2018.05.002.
  • [13] Xavier Défago, Maria Gradinariu, Stéphane Messika, and Philippe Raipin Parvédy. Fault-tolerant and self-stabilizing mobile robots gathering. In Proc. of 20th International Symposium on Distributed Computing, DISC, volume 4167 of LNCS, pages 46–60. Springer, 2006. doi:10.1007/11864219_4.
  • [14] Xavier Défago, Maria Potop-Butucaru, and Philippe Raipin Parvédy. Self-stabilizing gathering of mobile robots under crash or byzantine faults. Distributed Computing, 33(5):393–421, 2020. doi:10.1007/S00446-019-00359-X.
  • [15] Xavier Défago, Maria Potop-Butucaru, and Sébastien Tixeuil. Fault-tolerant mobile robots. In Chapter 10 of [19], pages 234–251. Springer, 2019. doi:10.1007/978-3-030-11072-7_10.
  • [16] Stefan Dobrev, Paola Flocchini, Giuseppe Prencipe, and Nicola Santoro. Multiple agents rendezvous in a ring in spite of a black hole. In Proc. of 7th International Conference on Principles of Distributed Systems, OPODIS, volume 3144 of LNCS, pages 34–46. Springer, 2003. doi:10.1007/978-3-540-27860-3_6.
  • [17] Caterina Feletti, Carlo Mereghetti, and Beatrice Palano. O(logn)-time uniform circle formation for asynchronous opaque luminous robots. In Proc. of 27th International Conference on Principles of Distributed Systems, OPODIS, volume 286 of LIPIcs, pages 5:1–5:21, 2023. doi:10.4230/LIPICS.OPODIS.2023.5.
  • [18] Paola Flocchini. Gathering. In Chapter 4 of [19], pages 63–82. Springer, 2019. doi:10.1007/978-3-030-11072-7_4.
  • [19] Paola Flocchini, Giuseppe Prencipe, and Nicola Santoro, editors. Distributed Computing by Mobile Entities, Current Research in Moving and Computing, volume 11340 of LNCS. Springer, 2019. doi:10.1007/978-3-030-11072-7.
  • [20] Taisuke Izumi, Daichi Kaino, Maria Gradinariu Potop-Butucaru, and Sébastien Tixeuil. On time complexity for connectivity-preserving scattering of mobile robots. Theoretical Computer Science, 738:42–52, 2018. doi:10.1016/J.TCS.2018.04.047.
  • [21] Sayaka Kamei, Anissa Lamani, Fukuhito Ooshita, Sébastien Tixeuil, and Koichi Wada. Gathering on rings for myopic asynchronous robots with lights. In Proc. of 23rd International Conference on Principles of Distributed Systems, OPODIS, volume 153 of LIPIcs, pages 27:1–27:17, 2019. doi:10.4230/LIPICS.OPODIS.2019.27.
  • [22] Anisur Rahaman Molla, Kaushik Mondal, and William K. Moses Jr. Efficient dispersion on an anonymous ring in the presence of weak byzantine robots. In Algorithms for Sensor Systems, ALGOSENSORS, volume 12503 of LNCS, pages 154–169. Springer, 2020. doi:10.1007/978-3-030-62401-9_11.
  • [23] Moumita Mondal and Sruti Gan Chaudhuri. Uniform scattering of robots on alternate nodes of a grid. In Proc. of 23rd International Conference of Distributed Computing and Networking, ICDCN, pages 254–259. ACM, 2022. doi:10.1145/3491003.3493231.
  • [24] Takashi Okumura, Koichi Wada, and Xavier Défago. Optimal l-algorithms for rendezvous of asynchronous mobile robots with external-lights. Theoretical Computer Science, 979:114198, 2023. doi:10.1016/J.TCS.2023.114198.
  • [25] Debasish Pattanayak, Klaus-Tycho Foerster, Partha Sarathi Mandal, and Stefan Schmid. Conic formation in presence of faulty robots. In Algorithms for Sensor Systems, ALGOSENSORS, volume 12503 of LNCS, pages 170–185. Springer, 2020. doi:10.1007/978-3-030-62401-9_12.
  • [26] Pavan Poudel, Aisha Aljohani, and Gokarna Sharma. Fault-tolerant complete visibility for asynchronous robots with lights under one-axis agreement. Theoretical Computer Science, 850:116–134, 2021. doi:10.1016/J.TCS.2020.10.033.
  • [27] Giuseppe Prencipe. Pattern formation. In Chapter 3 of [19], pages 37–62. Springer, 2019. doi:10.1007/978-3-030-11072-7_3.
  • [28] Samia Souissi, Yan Yang, and Xavier Défago. Fault-tolerant flocking in a k-bounded asynchronous system. In Proc. of 12th International Conference on Principles of Distributed Systems, OPODIS, volume 5401 of LNCS, pages 145–163. Springer, 2008. doi:10.1007/978-3-540-92221-6_11.
  • [29] Ichiro Suzuki and Masafumi Yamashita. Distributed anonymous mobile robots: Formation of geometric patterns. SIAM J. Comput., 28(4):1347–1363, 1999. doi:10.1137/S009753979628292X.
  • [30] Yan Yang, Samia Souissi, Xavier Défago, and Makoto Takizawa. Fault-tolerant flocking for a group of autonomous mobile robots. Journal of Systems and Software, 84(1):29–36, 2011. doi:10.1016/J.JSS.2010.08.026.

Appendix A Proofs of theorems

Lemma 3.2 (No Instantaneous Detection).

Let be a FD procedure which never outputs false positives. Then, if tcrash is the time of the first crash, then returns 𝖢𝖮𝖱𝖱𝖤𝖢𝖳 at time tcrash.

Proof.

By contradiction, suppose that detects the fault at time tcrash. Let σ be the snapshot through which (σ) returns 𝖥𝖠𝖴𝖫𝖳. However, σ would be the same if no robot has crashed at time tcrash. This contradicts the hypothesis that never outputs false positives.

Lemma 3.3 (Missing Activation Rule).

Let r be a crashed robot. Then, the crash of r cannot be identified during (r) by a reliable FI procedure.

Proof.

Assume all the robots in have the same coordinate system at any activation. Consider an activation scheduler 𝔄 and a suppression scheduler 𝔠 such that (r)=[𝔠(r),𝔣𝔪𝔞(r)] is the FIT of a robot r, with 𝔠(r)𝔣𝔪𝔞(r). By contradiction, let be a FI reliable procedure and let t[𝔠(r),𝔣𝔪𝔞(r)] be the time when another robot r identifies r as crashed by executing (σ) where σ is the snapshot of r taken at time t. Now, let us consider another suppression scheduler 𝔠 which is identical to 𝔠 except that the crash time of r is 𝔠(r)>t. According to the combined scheduler 𝔄𝔠, all the robots perform the same actions as under 𝔄𝔠 until t1. So, robot r at time t takes the same snapshot as under 𝔄𝔠 and thus identifies r as faulty by executing (σ). However, under 𝔠, robot r will be crashed at 𝔠(r)>t. This contradicts the hypothesis that is reliable.

Lemma 3.3.

Let tdetect be the first time when a reliable FD procedure detects a fault in a swarm . Then, tdetect>minr{𝔣𝔪𝔞(r)}.

Proof.

The proof is similar as in Section 3.3. Assume all the robots have the same coordinate system at any activation. Consider an activation scheduler 𝔄 and a suppression scheduler 𝔠 for , so that t𝔣𝔪𝔞=minr{𝔣𝔪𝔞(r)} and 𝔣𝔪𝔞=argminr𝔣𝔪𝔞(r) is the subset of robots which first miss an activation (at time t𝔣𝔪𝔞). By contradiction, let tdetectt𝔣𝔪𝔞 be the first time when a correct robot r detects a fault exists in the swarm by executing a reliable FD procedure . Now, let us consider another suppression scheduler 𝔠 which is identical to 𝔠 except that the crash time of any robot r is 𝔠(r)>tdetect. According to the combined scheduler 𝔄𝔠, all the robots perform the same actions until tdetect1. So, robot r at time tdetect takes the same snapshot as under 𝔄𝔠 and thus detects a fault by executing . However, under 𝔠, no robot has crashed before tdetect. This contradicts the hypothesis that is reliable.

Theorem 11.

M is a reliable model-dependent FD or FI procedure for a model M if and only if Ψ(𝔸,M) guarantees correctness and soundness for any primary algorithm 𝔸 under M.

Proof.

Below, we use instead of M to lighten the notation. We prove both the directions:

()

By hypothesis, detects/identifies a fault after a finite delay (so (σ) returns a 𝖥𝖠𝖴𝖫𝖳 value after a finite delay from the fault occurrence). Moreover, after such a delay, will always detect/identify the fault in the subsequent rounds. If returns at least a 𝖥𝖠𝖴𝖫𝖳 value, then Ψ(𝔸,) freezes the robot. Since this happens for any correct robot activated in the swarm, the whole swarm will freeze (thus Ψ(𝔸,) is sound).

By hypothesis, will always return a 𝖢𝖮𝖱𝖱𝖤𝖢𝖳 value if no fault occurs: this makes Ψ(𝔸,) behave exactly as 𝔸 (thus Ψ(𝔸,) is correct).

()

By hypothesis, if a crash occurs, then Ψ(𝔸,) makes all correct robots detect/identify it in finite delay, and freeze the swarm always maintaining coherence about the detected/identified faults in the following rounds. Let us assume that 𝔸 is an algorithm that makes any activated robot change its position at any activation. Thus, according to Algorithm 1, Ψ(𝔸,) must stop entering into the “if” body and instead it must continue executing the “else” statement from a certain time onward. Thus, must guarantee (ii) to detect/identify the crash after a finite delay and (iii) to continue detecting/identifying in the next rounds (coherence).

By hypothesis, if no crash occurs, then Ψ(𝔸,) continues solving the primary problem. So, it must always execute the “if” body. Thus, must always return 𝖢𝖮𝖱𝖱𝖤𝖢𝖳, guaranteeing to provide no false positives (i).

Theorem 15.

A reliable FD procedure for 𝒞𝒪𝙵 cannot exist assuming 1-survival crash schedulers.

Proof.

By contradiction let be a reliable model-dependent procedure for 𝒞𝒪𝙵 which tolerates up to n1 crashes. Let us consider a swarm executing Ψ(𝔸,) for a given primary algorithm 𝔸. Let 𝔠1 be a 1-survival crash scheduler such that 𝔠1(r)=, while 𝔠1(r)=t for any r{r}. Let us assume r has a fixed local coordinate system, and let us assume that 𝔸 makes r always stay still in its original position. By hypothesis, r is activated at time t, it executes (σ) and detects no fault in exists (by Section 3.3). From time t onward, r will always be the only robot to be activated. Since r cannot see its lights’ color and since neither its coordinate system nor position changes round by round, the taken snapshot will always be σ. Thus, r will never detect the fault. Contradiction achieved.

Theorem 16.

A punctual FD procedure for 𝒰𝚁𝚁 cannot exist, even considering (n1)-survival crash time schedulers.

Proof.

By contradiction, let be a punctual FD procedure for 𝒰𝚁𝚁. Let ={r0,,rn1} be a swarm of robots under 𝒰𝚁𝚁 executing Ψ(A,) for a given primary algorithm 𝔸. Consider the two combined schedulers 𝔄00 and 𝔄11 defined as

𝔄0(t)={r0+t},0(t)={r0} (i.e., r0 crashes at time 0)𝔄1(t)={r1+t},1(t)= (i.e., no robot crashes)

for any t (indices are considered in modulo n). So, by hypothesis, r1 is activated at time 1 under 𝔄00, and it correctly detects that r0 has crashed at time 0 by executing . Formally, r1 executes (σ) where σ is the snapshot taken by r1 at time 1 using Ξ as coordinate system, and returns a 𝖥𝖠𝖴𝖫𝖳.

Now, let us consider the same swarm under 𝔄11 and suppose that r1 has Ξ as its coordinate system. So, r1 is activated at time 0, takes the same snapshot σ and executes (σ) which still returns 𝖥𝖠𝖴𝖫𝖳. However, no crash has occurred. Contradiction achieved.

Theorem 17.

A reliable FD procedure for 𝒞𝒪𝚁𝚁 cannot exist even assuming 2-survival crash schedulers.

Proof.

By contradiction, let be a reliable FD procedure for 𝒞𝒪𝚁𝚁. W.l.o.g. assume that 𝔏={f0,,fk1} for a constant k>0, with f0 being the default color. Let us consider the SuperPoint problem where n>k robots lie on the same point and have to stay still, and let us consider the algorithm 𝔸NOP which trivially solves the problem making robots maintain their initial position and primary color /b. Assume all the robots always have the same egocentric coordinate system considering the super-point as the origin 𝐎=(0,0). Assume that (r0,,rn1) is the round-robin activation sequence of the robots, and let 0tk be the first time where a robot sets its 𝚜𝚎𝚛 light to a color f¯𝔏 which is already set in at least another robot, say rh. Note that this time always exists since |𝔏|=k<n. Formally, rt is activated at time t and computes Ψ(𝔸NOP,)(σ)=(𝐎,/b,f¯), where σ contains the multiplicity {(𝐎,/b,f¯)}1. Moreover, assume that at time t, all the robots except for rh and rt crash, while the two remaining robots remain alive forever. So, from any t>t, the two survival robots will be activated in alternate rounds, will take the same snapshot σ, and will always execute Ψ(𝔸NOP,)(σ)=(𝐎,/b,f¯), thus never detecting the fault. Contradiction achieved.

Theorem 18.

A reliable FD procedure for a model M under SSYNCH cannot exist.

Proof.

By contradiction, let be a reliable FD procedure under 𝒰𝚂 (i.e., the strongest SSYNCH model). Let ={r0,,rn1} be a swarm under 𝒰𝚂. Suppose all the robots use the same coordinate system Ξ. Suppose the swarm works under a SSYNCH activation scheduler 𝔄 and a crash activation scheduler 𝔠 so that r0 is the only robot to crash, and it crashes at time t1. Formally, 𝔠(r0)=t1 and 𝔠(ri0)=. We assume w.l.o.g. that r1 is the first robot to correctly detect the fault at time t2>t1 by running (σ) where σ is the snapshot of r1 at time t2. Moreover, let r0𝔄(t1) and r0𝔄(t) for any t1<tt2. Now consider the schedulers 𝔄 and 𝔠 so that

𝔄=𝔄 except for: 𝔄(t1)=𝔄(t1){r0}𝔠(ri)=i

Under 𝔄𝔠, the robots that execute their LCM are the same as under 𝔄𝔠 until t2, so the evolution of the swarm is the same. Specifically, r1 is activated at time t2, takes the same snapshot σ, and executes (σ) which still detects the fault. However, no crash has occurred. Contradiction achieved.