Abstract 1 Introduction 2 Problem Formulation 3 Background 4 Model generation 5 Generating MSO Sets, Conflicts, and Diagnoses 6 Test and Validation 7 Discussion 8 Conclusion References

Are Diagnostic Concepts Within the Reach of LLMs?

Anna Sztyber-Betley ORCID Warsaw University of Technology, Poland Elodie Chanthery ORCID LAAS-CNRS, INSA, University of Toulouse, France Louise Travé-Massuyès ORCID LAAS-CNRS, University of Toulouse, France Silke Merkelbach ORCID Fraunhofer IEM, Paderborn, Germany Karol Kukla Warsaw, Poland Maxence Glotin LAAS-CNRS, INSA, University of Toulouse, France Alexander Diedrich ORCID Helmut-Schmidt-University, Hamburg, Germany Oliver Niggemann ORCID Helmut-Schmidt-University, Hamburg, Germany
Abstract

Model-based diagnosis is a cornerstone of system health monitoring, allowing for the identification of faulty components based on observed behavior and a formal system model. However, obtaining a useful and reliable model is often an expensive and manual task. While the generation of a formal model was the aim of previous work, in this paper, we propose a methodology to use large language models to generate Minimal Structurally Overdetermined sets (MSOs). MSOs are specific subsets of the model equations from which diagnosis tests can be obtained. We investigate two different directions: (i) the large-language-models’ ability to generate MSO sets for hybrid systems, similar to those generated by the well-known Fault Diagnosis Toolbox (FDT) (ii) the automated generation of MSOs for Boolean circuits, as well as the computation of the diagnoses. We thus show how both dynamic and static systems can be analysed by large-language models and how their output can be used for effective fault diagnosis. We evaluate our approach on a set of arithmetic and logic circuits, using OpenAI’s LLMs 4o-mini, o1, and o3-mini.

Keywords and phrases:
Fault Diagnosis, Large Language Models, LLMs, Model Based Diagnosis, MSO, Redundancy Relations, Conflicts, Diagnoses
Funding:
Elodie Chanthery: The work is supported by ANITI through the French “Investing for the Future – P3IA” program under the Grant agreement noANR-19-P3IA-0004.
Louise Travé-Massuyès: The work is supported by ANITI through the French “Investing for the Future – P3IA” program under the Grant agreement noANR-19-P3IA-0004.
Copyright and License:
[Uncaptioned image] © Anna Sztyber-Betley, Elodie Chanthery, Louise Travé-Massuyès, Silke Merkelbach, Karol Kukla, Maxence Glotin, Alexander Diedrich, and Oliver Niggemann; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Computing methodologies Knowledge representation and reasoning
Supplementary Material:

Software  (Source code): https://github.com/asztyber/LLM_diagnostic_concepts [18]
  archived at Software Heritage Logo swh:1:dir:d265fcfc25d623bff2fa34d48590c7eb66d08995
Acknowledgements:
This work has benefited from participation in Dagstuhl Seminar 24031 “Fusing Causality, Reasoning, and Learning for Fault Management and Diagnosis”.
Editors:
Marcos Quinones-Grueiro, Gautam Biswas, and Ingo Pill

1 Introduction

Model-based diagnosis (MBD) is a cornerstone of system health monitoring, allowing for the identification of faulty components based on observed behavior and a formal system model. One common approach within MBD is structural analysis [8]. Structural analysis is used to obtain Analytical Redundancy Relations (ARRs) for fault diagnosis. Performing structural analysis means abstracting the system model by keeping only the links between equations and variables. The main advantages are that it can be applied to large-scale systems, linear or non-linear, and even under uncertainty. Structural analysis uses the computation of Minimal Structurally Overdetermined sets (MSOs), i.e. sets that contain one equation more than the number of variables. Due to this overdetermined property, they play a key role in detecting and isolating faults [4]. Traditionally, the computation of MSOs is performed using dedicated tools such as the Fault Diagnosis Toolbox (FDT) [5], which apply algorithmic methods grounded in analytical redundancy and structural analysis. On the other hand, the main disadvantage of using tools such as the FDT is that they require a correct and reliable model of the underlying system.

With the recent rise of large language models (LLMs), which demonstrate impressive capabilities in reasoning, synthesis, and symbolic manipulation, a question naturally arises: can these models be repurposed to perform structured engineering tasks, such as generating MSO sets directly from system diagrams, thus reducing the need to rely on manually specified models? If so, this could open new avenues for intuitive, rapid prototyping and integration of natural language interfaces into diagnostic systems.

In this work, we propose an exploratory evaluation of LLMs’ ability to generate MSO sets from engineering documentation, comparing its outputs to those produced by the Fault Diagnosis Toolbox [5]. We aim to assess both the correctness and completeness of the MSO sets proposed, as well as to identify the strengths and limitations of using LLMs in such a formal and structured context. We perform a similar analysis for the generation of conflicts and diagnoses. Our goal is not only to measure performance, but also to understand the potential and boundaries of LLMs in a field traditionally dominated by rule-based and algorithmic approaches.

We carry out the test on the OpenAI models: 4o-mini, o1111https://openai.com/o1/, and o3-mini222https://openai.com/index/openai-o3-mini/, as the o-series of models shows leading performance in math, coding, and science questions333Shortly before the submission of the paper the new models were released: o3 and o4-mini, so further improvement can be expected., and 4o-mini is a time and cost effective alternative.

Refer to caption
Figure 1: The overall methodology. Model generation was covered in previous work [12]. In both approaches MSOs are generated. The broad approach is applicable to many types of systems, in particular hybrid systems. The dedicated approach is for logic circuits.

Recent advancements in LLMs have spurred their application in fault diagnosis across a range of complex systems. These models are being explored not only for their fault detection capabilities but also for their ability to provide interpretability, adaptability, and generalization in data-driven diagnostics. Several works have proposed hybrid frameworks that integrate LLMs with traditional diagnosis tools to improve explainability and operator support. For instance, the authors of [3] combine a physics-based diagnostic tool with an LLM to enhance fault interpretability in nuclear power plants, demonstrating improved transparency and operator interaction through natural language explanations. In software engineering, AutoFL [6] employs LLMs for fault localization, enhancing developer trust by providing rationales for fault hypotheses and navigating large code-bases using prompt engineering. Similarly, LLMAO [21] introduces a fine-tuned LLM architecture capable of fault localization at a line level without relying on test coverage data, surpassing traditional ML-based methods. From an industrial systems perspective, various methods have been proposed to adapt LLMs to multi-modal and time-series data. FD-LLM [11] aligns LLMs with engineering data using modal alignment and prompt learning to handle overlapping features in complex equipment diagnostics. FaultExplainer [7] uses LLMs to generate plausible and actionable explanations, but also highlights its limitations, including reliance on PCA-selected features and some hallucinations. In cyber-physical systems, FLEX [13] and FaultLines [14] demonstrate how open-source LLMs can perform anomaly and fault detection using retrieval-augmented generation and prompt engineering. These works reveal that model performance is closely tied to prompt design and textual encoding strategies.

LLMs also show promise in handling challenges such as cross-domain generalization and data scarcity. The authors of [19] present an LLM-based framework for bearing fault diagnosis that combines textual representation of sensor data with fine-tuning strategies to achieve superior performance across different datasets and operational conditions. The AAD-LLM framework [17] further extends this adaptability by enabling multi-modal, zero-shot anomaly detection in manufacturing systems, using pre-trained LLMs without requiring retraining or finetuning. They also propose a very interesting state of the art on LLMs use for Time Series forecasting. Overall, the field is evolving toward more robust, interpretable, and adaptive diagnostic solutions leveraging LLMs. However, none of these works really links LLM results to concepts developed in the field of model-based diagnostics. This is the major contribution of our article.

This article is organized as follows: Section 2 recalls some important concepts for diagnosis. Section 4 briefly details previous work. Section 5 presents our two main contributions. The road approach for creating MSOs from piping and instrumentation diagrams, and a Dedicated Approach for arithmetic and logic circuits. Finally, Section 6 presents our empirical evaluation. At the end of the article we discuss our results and present some future research directions.

2 Problem Formulation

Figure 1 illustrates our methodology. Input to the methodology are P&ID diagrams and circuit diagrams, i.e. a common form of engineering documentation for use-cases in the process industry and hardware design. Block 1 “Model Generation” has been described in [12]. It generates a set of equations that can be used by the Fault Diagnosis Toolbox to obtain MSO sets. Section 4 briefly recalls the model generation process.

For our solution in this article, we used two independent prompting approaches:

  1. 1.

    A Broad Approach (in green box), able to handle continuous systems described by the differential equations that generates MSO sets from a set of differential equations

  2. 2.

    A Dedicated Approach to arithmetic and logic circuits (in orange box) that generates MSO sets, conflicts, and finally obtains diagnoses for the considered system.

Blocks 2 and 3 are used for the Broad Approach, described in Section 5.2. Blocks 4, 5, and 6 are used for the Dedicated Approach to arithmetic and logic circuits, described in Section 5.1. Each block uses an LLM for its reasoning.

In the case of arithmetic and logic circuits, checking the consistency of observations with the model is straightforward and leads to unambiguous answers. In the case of continuous, dynamical systems, in the presence of noise and dynamical fault profiles residual evaluation is more challenging. Therefore, for circuits, we perform all the steps that result in diagnoses (Blocks 4 to 6), and in the Broad Approach we only generate MSO sets (Blocks 2 and 3).

3 Background

A system model (z,x,f) typically comprises both unknown variables, denoted by x, and known variables z, which are measured using sensors.

The variables x are partitioned into two categories: differential variables x1 and algebraic variables x2. Faults affecting the system are explicitly represented by a dedicated vector of parameters f. We denote the sets of known variables, unknown variables, and faults by 𝒵, 𝒳, and , respectively.

A commonly used representation, known as a semi-explicit Differential Algebraic system of Equations (semi-explicit DAE), for general systems in the continuous-time domain is as follows:

(z,x,f):{dx1(t)dt=h(x1(t),x2(t),zc(t),f),x1(t0)=x00=l(x1(t),x2(t),zc(t),f),zs(t)=g(x1(t),x2(t),f). (1)

Here, x1(t)nx1 and x2(t)nx2 denote the vectors of unknown variables. The vectors zs(t)nzs and zc(t)nzc represent the measured outputs and controlled inputs, respectively, both considered known. In systems without external actuation, zc(t) may be zero.

The functions h, l, and g can be linear or nonlinear and typically depend on a set of system parameters, denoted by 𝒵p (e.g., tank diameter, nominal flow rate, etc.).

Interestingly, dynamic systems (represented by differential equations) and static systems (represented by algebraic equations) are sub-classes of DAEs.

For any vector ν, we define ν¯ as the augmented vector comprising ν and its time derivatives up to a certain (unspecified) order.

Definition 1 (Analytical Redundancy Relations (ARR)).

ARRs are relations (z¯)=′′(f¯) obtained from (z,x,f) by formally eliminating unknown variables x. While ′′(f¯) is the internal form that depends on the faults and is not known, (z¯) is the computation form and can be computed from the known variables and their derivatives.

(z¯) defines a set of ARRs. A single ARR takes the form arri(z¯)=ri, where ri is a scalar signal named residual and z¯ a subvector of z¯. It can be used as residual generator.

Definition 2 (Residual generator for (z,x,f)).

A relation of the form arri(z¯)=ri, with input z¯ a subvector of z¯ and output ri, a scalar signal named residual, is a residual generator for the model (z,x,f) if, for all z consistent with (z,x,f), it holds that limtr(t)=0.

In simple terms, a residual generator produces a signal called a residual, which ideally remains zero when the system operates correctly. Any deviation from zero may indicate the presence of a fault.

From the system model, it is possible to derive Analytical Redundancy Relations (ARRs) by exploiting the inherent analytical redundancy. This is typically achieved through variable elimination techniques, resulting in relations that involve only the known variables. ARRs serve as the foundation for diagnosis tests, enabling the verification of whether the measurements z are consistent with the system model (z,x,f). If a fault is present in the system, it is expected to affect some of the measurements, and consequently, the residuals. If no such influence is observable, the fault is said to be non-detectable, and as such, it cannot be identified [9].

Building on the foundational work by Cassar and Staroswiecki [1], and Travé-Massuyès et al. [20], structural analysis has emerged as a powerful tool for deriving ARRs. This method abstracts the system model by focusing solely on the structural relationships between equations and variables, disregarding the specific functional forms. One of the key advantages of structural analysis is its applicability to large-scale systems, whether linear or nonlinear, and even in the presence of model uncertainty. Along the structural approach, the structure of a system (z,x,f), or structural system, can be represented by a biadjacency matrix crossing variables and equations, named the Incidence Matrix.

Definition 3 (Incidence matrix).

The Incidence Matrix of a system (z,x,f) is an ne×(nx+nz) binary matrix, where nx=nx1+nx2 is the number of unknown variables, nz=nzs+nzc is the number of known variables, and ne is the number of equations. Each row stands for an equation and each column for a variable. A 1 in position (i,j) means that the equation of row i contains the variable of column j.

When applied to fault diagnosis, structural analysis helps identify subsets of equations that exhibit redundancy. The structural redundancy ρ of a set of equations is defined as the difference between the number of equations and the number of unknown variables within that subset.

Of particular diagnostic relevance are the Proper Structurally Overdetermined (PSO) sets, which are sets of equations that together contain just enough redundancy to allow fault detection through consistency checks.

A just determined part refers to a set of equations where the number of equations matches the number of unknown variables, allowing for a unique solution. In contrast, an underdetermined part has fewer equations than unknowns, making it impossible to uniquely determine the values of all variables.

Definition 4 (PSO set).

A subset of equations ψ(z,x,f) is a PSO set if its number of equations is greater than the number of unknown variables, and this overdetermination applies to the entire set (i.e., the set is structurally redundant and contains no exactly or underdetermined parts).

Another interesting concept are the minimal subsets of equations ψ(z,x,f) that possess exactly one degree of structural redundancy, i.e., ρψ=1. Such subsets are referred to as Minimal Structurally Overdetermined (MSO) sets.

Definition 5 (MSO set).

A subset of equations ψ(z,x,f) is an MSO set of (z,x,f) if (1) ρφ=1, and (2) no subset of ψ is overdetermined. The set of MSO sets of is denoted Ψ.

Certain MSO sets have been shown to support the formulation of diagnostic tests [9], making them essential components for residual generation and fault detection. These particular subsets are called Fault-Driven Minimal Structurally Overdetermined (FMSO) sets [15].

Let φ denote the set of faults involved in a given subset of equations φ(z,x,f), then FMSO sets can be defined as follows.

Definition 6 (FMSO set).

A subset of equations φ(z,x,f) is an FMSO set of (z,x,f) if (1) φ is an MSO set and (2) φ. The set of FMSO sets of is denoted Φ.

Given an FMSO set φ, φ is defined as the fault support of φ and the physical components modeled by the equations in φ define its component support COMPφ.

In most cases, FMSO sets can be transformed into Analytical Redundancy Relations (ARRs) through sequential elimination. Due to their structural properties, all unknown variables appearing in an FMSO set φ can be algebraically resolved using |φ|1 equations. These expressions can then be substituted into the remaining |φ|-th equation, yielding an ARR formulated off-line. This ARR is subsequently employed on-line as a diagnostic test.

Building upon the findings in [2], which establish a connection between model-based approaches from the control community (FDI) and those from the artificial intelligence community (DX), the concept of an R-conflict, originally introduced by Reiter [16], can be related to FMSO sets under the assumption of Model Representation Equivalence.

Definition 7 (R-conflict).

The component support COMPφ of an FMSO set φ is an R-conflict if the diagnosis test derived from φ fails when evaluated with the measurements obtained from the physical system. A minimal R-conflict is an R-conflict that does not strictly include (set inclusion) any R-conflict.

An R-conflict indicates that at least one component within the R-conflict must be faulty to explain the observed behavior; equivalently, it is not possible for all components in the R-conflict to be functioning normally. Interestingly, the set of minimal diagnoses can be generated from the set of minimal R-conflicts.

Definition 8 (Minimal diagnosis).

A set of physical components Δ, modeled by some of the equations of (z,x,f), is a minimal diagnosis if and only if it is a minimal hitting 444A hitting set for a collection 𝒞 of sets is a set H{S/S𝒞} such that HS{} for each S𝒞. A hitting set is minimal if and only if no proper subset of it is a hitting set for 𝒞. set of the collection of minimal R-conflicts.

Figure 2 illustrates the different concepts that have been introduced on the well-known polybox example.

Figure 2: Illustration of the concepts using the polybox example. In the broad approach we output only the MSOs, while in the dedicated approach we output diagnoses.

The notion of FMSO set also plays a pivotal role in the formal definitions of structural detectable and structural isolable faults. In the following, we revisit these fundamental definitions as presented in [9].

Definition 9 (Detectable fault).

A fault f is structurally detectable in the system (z,x,f) if there exists an FMSO set φΦ such that fφ.

Definition 10 (Isolable faults).

Given two structurally detectable faults f and f of , ff, f is structurally isolable from f if there exists an FMSO set φΦ such that fφ and fφ.

The Fault Diagnosis Toolbox (FDT) [5] is a software framework designed for the analysis and synthesis of fault diagnosis in systems that are typically modeled with differential-algebraic equations. By leveraging a structural representation of the system, the toolbox enables the automatic extraction of FMSO sets. From these sets, it allows one to generate Analytical Redundancy Relations (ARRs), which can subsequently be used as diagnosis tests.

4 Model generation

Previous work [12] proposes an approach to create mathematical models for process industry systems using multi-modal large language models (MLLM). It presents a five-step prompting approach that uses a piping and instrumentation diagram (P&ID) and natural language prompts as its input. For clarity, we will briefly sketch the previous approach in this section.

The five steps of the prompting approach are the following:

  1. 1.

    Read Diagram – Let the MLLM read the diagram image data and represent it in some partially specified intermediate format. Partial specification is performed in the prompt.

  2. 2.

    Identify Sensors – The LLM creates a table containing existing sensors, their type, and their placement from the input diagram and from the output from step 1 in the form of CompsConnections. Context about P&IDs, possibly occurring sensors, and the placement of the sensors in the diagram are provided in the system message.

  3. 3.

    Create the mathematical Equations – The mathematical equations are created by the LLM using provided variable names and assumptions.

  4. 4.

    Sensor Matching and Variable Assignment – In this step, all the results from the previous steps are merged, the faults are added to the equations and the resulting mathematical model is created.

  5. 5.

    Format Model for Fault Diagnosis Toolbox – In the final step, the mathematical model is transformed to be suitable for the FDT into ~.

All steps provide the MLLM with the P&ID as input. In addition, the steps take a system message and a user prompt as input. The system message contains the generic task for the respective step. However, taking current abilities of MLLMs into account, it is still domain dependent. The user message contains a subset of the external information which is specific for the system, where is denotes all available external information. The prompts can be found on GitHub555https://github.com/silkeme/DX24_Model_Creation_LLMs
The repository contains the prompts, all mentioned information about the systems used for the evaluation, the resulting mathematical models, and all intermediate outputs.
. The approach aims to work for water tank systems with standard components, such as tanks, pumps, valves, flow indicators, and level indicators.

5 Generating MSO Sets, Conflicts, and Diagnoses

As described before, for our solution in this article, we use two independent prompting approaches:

  1. 1.

    Dedicated approach – designed to handle arithmetic and logic circuits,

  2. 2.

    Broad approach – able to handle continuous systems described by differential equations.

While the Dedicated Approach outputs proper diagnoses, the Broad Approach outputs the more general FMSO sets from which diagnoses can be computed using actual observations and tools such as the Fault Diagnosis Toolbox.

5.1 Dedicated approach to arithmetic and logic circuits

To generate the diagnoses, we test the following approach consisting of three steps:

  1. 1.

    Generation of MSO sets – We provide system equations and the sets of X, Z, and F variables as inputs. The model is asked to output all Minimal Structurally Overdetermined (MSO) Sets. This is illustrated by block 4 in Figure 1. Table 1 presents the prompt. The prompt contains a theoretical description and the Polybox system.

  2. 2.

    Generation of Conflicts – We provide the system description, the MSO sets, and observations as inputs. The LLM is asked to evaluate all MSOs for consistency and to generate all conflicts. The prompt contains the theoretical description, detailed steps, and hints for the Python implementation (we test variants with code interpreter allowed) and three simple examples. This is illustrated by block 5 in Figure 1.

  3. 3.

    Generation of Diagnoses – We provide the set of conflicts as inputs. The model is asked to generate all minimal diagnoses. The prompt contains the theoretical description, pseudo-code of the algorithm, and Polybox example. This is illustrated by block 6 in Figure 1.

Figure 3 illustrates the inputs and outputs of each step in the Dedicated Approach (we follow the example from Figure 2). We start with a system description given by a set of equations (system model (z,x,f), Equation 1) and generate MSO sets (Definition 5) using the prompt given in Table 1. Next, residual generators (Definition 2), resulting from each MSO, are evaluated given the values of known variables z. The value of a residual generator different from 0 indicates the existence of R-conflict (Definition 7). The output of the second step is the set of conflicts. In the last step, the conflicts are used to generate minimal diagnoses (Footnote˜4).

The prompts for the computation of conflicts and diagnoses can be found on GitHub666https://github.com/asztyber/LLM_diagnostic_concepts/blob/main/prompts/dedicated/prompts_Karol.py The file contains three prompts for the three mentioned steps. Code to run the experiments can be found in https://github.com/jadlownik/FaultDiagnosis/tree/develop.

Table 1: Prompt for computation of Minimal Structurally Overdetermined (MSO) Sets (Dedicated Approach).
Perform an analysis of the equation system equations and identify all Minimal Structurally Overdetermined (MSO) sets.
Definitions:
1. Unknown variables: These are only the variables that start with “x”.
2. Structural redundancy: This is the difference between the number of equations and the number of unique unknown variables in those equations:
R=(number of equations)(number of unique unknowns).
3. PSO (Properly Structurally Observable): A subset is PSO if its structural redundancy is greater than 0.
Conditions:
1. The selected set of equations must have exactly one structural redundancy.
2. None of its proper subsets can be PSO (all must have redundancy 0).
Output: JSON format: { "mso": [...] }, where each inner list represents a valid MSO set.
RESPONSE MUST BE IN JSON FORMAT. DO NOT GENERATE CODE AND RETURN IT IN RESPONSE.
<example>
<input>
equations = {
’M1’: ’a * c = x01’,
’M2’: ’b * d = x02’,
’M3’: ’c * e = x03’,
’A2’: ’x01 + x02 = f’,
’A1’: ’x02 + x03 = g’,
}
</input>
<output>
{"mso" = [[’M1’, ’M2’, ’A1’], [’M2’, ’M3’, ’A2’], [’M1’, ’M3’, ’A1’, ’A2’]] }
</output>
</example>
Refer to caption
Figure 3: Inputs and outputs of each step in Dedicated Approach.

5.2 Broad approach

To generate the MSO sets, the approach consists of two steps:

  1. 1.

    Generation of the Incidence Matrix – As input, we provide system equations. The model is asked to return the incidence matrix as a Python dictionary. This is illustrated by block 2 in Figure 1. The prompt for the variant without code interpreter is shown in Table 2.

  2. 2.

    Generation of MSO sets – We provide the incidence matrix including only unknown variables as input. The model is asked to find all the MSO sets for the provided model in two sub-steps. First, find all the PSO sets (Definition 4), then, within these PSO sets, find the minimal ones. Two versions were implemented, one with code interpreter and one without code interpreter. This is illustrated by block 3 in Figure 1. The prompt for the variant without code interpreter is shown in Table 3.

Refer to caption
Figure 4: Inputs and outputs of each step in Broad Approach.

Figure 4 illustrates the inputs and outputs of the two steps for finding the MSO sets. In the first step, the incidence matrix is computed (Definition 3). Only the part containing unknown variables x is computed, which is the important part for MSO generation. In the second step, MSOs (Definition 5) are computed from the incidence matrix via PSO sets.

The full set of prompts (including variant for use of code interpreter) can be found on GitHub777https://github.com/asztyber/LLM_diagnostic_concepts/blob/main/prompts/broad/ The directory contains prompts and code to run them..

Table 2: Prompt for computation of Incidence Matrices (Broad Approach).
Your job is to create an incidence matrix of the provided model. In the incidence matrix, each row represents an equation and each column represents an UNKNOWN variable. Note that unknown variables are stored under key ’x’ in the provided model.
Return the incidence matrix. Please return the matrix as a Python dictionary in the following format: ’eq0’: [’unknown var in eq0’, …], ’eq1’: [’unknown var in eq1’, …], …
Each line in the dictionary should correspond to one equation (do not write everything on the same line). Only return the dictionary.
Be careful with the equations of type ’fdt.DiffConstraint’. Change those equations by a simple equality relation. If you have fdt.DiffConstraint(dt,t), change it by t - dt. Make sure to complete your analysis before responding to me.
Table 3: Prompt for computation of Minimal Structurally Overdetermined (MSO) Sets from Incidence Matrices (Broad Approach).
Your job is to find all the MSO (Minimal Structurally Overdetermined) sets for the provided model, represented here by the provided incidence matrix. In the incidence matrix, each line represents an equation with the unknown variables that belong to this equation. Note that the matrix contains only the unknown variables. MSO sets consist of equations and contain one more equation than the number of unknown variables.
Proceed in two parts. First, find all the PSO (Proper Structurally Overdetermined) sets. Then, within these PSO sets, collect those that are minimal (PSO sets that do not contain any subsets that are PSO) to keep only the MSO sets. Store all the MSO sets in a python dictionary and return only the dictionary to me. Use one line for each MSO set.
Additionally, include the number of MSO sets found in the dictionary. You MUST only return the dictionary. Don’t make up an example, work with the provided incidence matrix. Make sure to complete your analysis before responding to me.

6 Test and Validation

All approaches were evaluated through OpenAI API. We evaluated the following models: gpt-4o-mini-2024-07-18 (referred in the following as 4o-mini), o3-mini-2025-01-31 (referred in the following as o3-mini), and o1-2024-12-17 (referred in the following as o1). 4o-mini was evaluated with the access to the code interpreter, therefore the model could write Python code and run it to generate the answer888We tried gpt-4o and gpt-4o-mini but the accuracy was close to zero on unseen examples. Performance of gpt-4o and gpt-4o-mini with code interpreter was similar in initial experiments, so we decided to evaluate full set of examples on gpt-4o-mini because of costs.. o1 and o3-mini are reasoning models999https://openai.com/index/learning-to-reason-with-llms/, trained with reinforcement learning to perform complex reasoning, and performing all computations in internal chain of thought. Note that we cannot tell for sure that these models did not use a code interpreter in the background as well.

Experiments were performed with temperature 0 and top_p = 0 for 4o-mini. The reasoning effort for o1 and o3-mini was ’high’ for the Dedicated Approach and ’medium’ for the Broad Approach. Tests were repeated 10 times for each test case (test case including example description and the set of observations) for 4o-mini and o3-mini. Tests for o1 were performed only once, because of costs101010One iteration costed around 30$.

Each LLM reasoning step was evaluated independently, i.e. for each blue box in Figure 1 we provided correct inputs, irrespective of the result of the previous step. That way, we can evaluate the accuracy of each algorithm step reliably.

We carried out tests on 23 arithmetic and logic circuits with different levels of complexity (Figure 5). The circuits contain adders, multipliers, and AND, OR, NOR, NAND, and XOR gates. Table 4 summarizes examples’ characteristics111111You can find image for each example here https://github.com/jadlownik/FaultDiagnosis/tree/develop/images and descriptions here https://github.com/asztyber/LLM_diagnostic_concepts/tree/main/examples. The number of minimal conflicts and diagnoses can vary for a given example depending on the observations121212The results contain readable Excel files containing generated and correct MSOs, conflicts and diagnoses.. For each example, we assumed all inputs and outputs are measured. We only considered components faults. Sensor faults and sensor equations were omitted.

Refer to caption
(a) Example 10.
Refer to caption
(b) Example 11.
Refer to caption
(c) Example 12.
Refer to caption
(d) Example 13.
Figure 5: Selected examples with different level of complexity.
Table 4: Example Characteristics.
Example |X| |Z| |F| Relations MSOs Conflicts Diagnoses
Example 2 10 7 5 5 3 0–3 0–8
Example 3 13 8 7 7 2 0–2 0–12
Example 4 13 8 7 7 2 0–2 0–12
Example 5 13 8 7 7 2 0–2 0–12
Example 6 17 11 8 8 3 0–3 0–21
Example 7 13 9 6 6 3 0–2 0–5
Example 8 18 10 11 11 6 0–5 0–45
Example 9 29 17 15 15 6 0–6 0–165
Example 10 3 3 1 1 1 0–1 0–1
Example 11 7 5 3 3 1 0–1 0–3
Example 12 10 7 5 5 3 0–2 0–5
Example 13 13 8 7 7 2 0–2 0–12
Example 14 13 8 7 7 2 0–2 0–12
Example 15 13 8 7 7 2 0–2 0–12
Example 16 17 11 8 8 3 0–3 0–21
Example 17 13 9 6 6 3 0–3 0–9
Example 18 18 10 11 11 6 0–5 0–57
Example 19 29 17 15 15 6 0–5 0–153
Example 20 12 9 5 5 3 0–3 0–8
Example 21 12 9 5 5 3 0–3 0–8
Example 22 7 5 3 3 1 0–1 0–3
Example 23 3 3 1 1 1 0–1 0–1
Example 24 7 5 3 3 1 0–1 0–3

As shown in Figure 6, we evaluate the models’ performance using complementary metrics that capture different aspects of accuracy in incidence matrix generation. For each equation, precision measures how many of the model’s predicted variables are correct, while recall measures how many of the correct variables the model identified. The F1 score combines these into a single metric, penalizing both missing and spurious variables. The accuracy metric is the strictest, requiring perfect variable sets for each equation.

4o-mini often includes known variables and faults as variables involved in incidence matrix, resulting in the poor scores.

Figure 6: Performance comparison of three models (4o-mini, o1, and o3-mini) on incidence matrix generation. The bars show mean values with standard deviations computed across all examples and runs. For each model, we evaluate four metrics: F1 score (harmonic mean of precision and recall), precision (ratio of correctly identified variables to all variables predicted in an equation), recall (ratio of correctly identified variables to all variables that should be in an equation), and accuracy (percentage of equations where all variables match exactly). The precision, recall, and F1 scores are first computed for each equation separately and then averaged across all equations in an example. The accuracy represents a stricter metric, as it only counts equations where the model identified exactly the right set of variables, with no missing or extra variables.

Ground truth for incidence matrices, MSO sets, and residual evaluation was computed using Fault Diagnosis Toolbox [5]. Correct minimal diagnoses were computed with the implementation of minimal hitting sets algorithm.

The evaluation metrics (Precision, Recall, and F1 score) were computed by comparing the generated MSO sets with the ground truth MSO sets. Precision measures the fraction of correctly identified MSO sets among all generated sets (TPTP+FP), while Recall indicates the fraction of ground truth MSO sets that were successfully found (TPTP+FN)131313In the case where the number of ground truth solutions is zero (P=0), and the generated solution is also empty (FP=0) we set Precision and Recall to 1, to avoid penalizing correct answers for the empty solutions.. The F1 score is the harmonic mean of Precision and Recall (2PrecisionRecallPrecision+Recall). The metrics for conflicts and diagnoses were computed in an analogous way.

The Figures 8, 10, and 12 show F1 scores for different model versions (4o-mini, o1, and o3-mini). Each data point represents the mean F1 score for examples with the same number of solutions (MSOs, conflicts, or diagnoses), with error bars indicating the standard deviation across these examples. The dashed lines show the linear trend of performance as the number of solutions increases. The Figures 7, 9, and 11 show F1 scores sorted by the number of relations in the example.

Figure 7: Performance comparison of MSO generation across different versions (4o_mini, o1, o3_mini) relative to the number of relations in examples.
Figure 8: Performance comparison of MSO generation across different versions (4o_mini, o1, o3_mini) relative to the number of MSOs in examples.
Figure 9: Performance comparison of minimal conflicts generation across different versions (4o_mini, o1, o3_mini) relative to the number of relations in examples.
Figure 10: Performance comparison of minimal conflicts generation across different versions (4o_mini, o1, o3_mini) relative to the number of conflicts in examples.
Figure 11: Performance comparison of minimal diagnoses generation across different versions (4o_mini, o1, o3_mini) relative to the number of relations in examples.
Figure 12: Performance comparison of minimal diagnoses generation across different versions (4o_mini, o1, o3_mini) relative to the number of diagnoses in examples.

The bar plots (Figures 13, 14, 15) show the aggregated performance metrics (F1 score, Precision, and Recall) for each model (4o-mini, o1, and o3-mini) across all test examples. Each bar represents the mean value of the corresponding metric, with error bars indicating the standard deviation.

7 Discussion

An analysis of the obtained results leads to the following conclusions.

For incidence matrix generation (Figure 6), we can observe that o1 and o3-mini achieve perfect accuracy. 4o-mini tends to include additional variables (not in X but correctly assigned to the equations); otherwise, the results are correct. Incidence matrix generation is the simplest algorithmic task.

For MSO generation (Figure 7), 4o-mini has problems with examples containing only one relation. Otherwise, the performance is relatively stable with an increasing number of examples. 4o-mini has access to a code interpreter, making handling different input sizes easier. The performance of o1 and o3-mini decreases with the number of relations in the example. We can observe similar tendencies depending on the number of MSOs to be generated (Figure 8). Plots on Figure 7 and Figure 8 show results aggregated over two prompting approaches (Dedicated and Broad). Figure 13 compares the performance of different models and prompting approaches. The performance of the Dedicated and Broad Approach is similar despite much more domain knowledge included in the dedicated prompts. Additionally, the Broad Approach includes a correct incidence matrix as an input for MSO generation, but it does not influence the results significantly. o1 shows the best performance across all metrics, while 4o-mini is the worst despite the access to the code interpreter.

Performance in conflict generation, depending on the number of relations and conflicts, is presented respectively in Figure 9 and Figure 10. Trend lines show increasing performance with increasing complexity, but this is caused by poor performance when the number of conflicts is zero (i.e. there are no inconsistencies in the system description and observation). o1 performs best for conflict generation (Figure 14).

Diagnoses generation is the most computation-heavy task. We observe a clear decrease in the performance with increasing complexity (Figure 11 and Figure 12). o3-mini performs best for diagnoses generation (Figure 15).

As LLMs were not primarily designed to handle algorithmic and computational tasks, the performance of reasoning models is surprisingly good. o3-mini achieves F1=0.917±0.249 on conflicts generation and F1=0.784±0.322 on diagnoses generation. The examples were generated manually and were unavailable in the training data (correct solutions were released after the tested models’ release).

8 Conclusion

This paper explores the use of large language models (LLMs) as a novel tool for performing structural analysis in model-based diagnosis (MBD), specifically focusing on the automatic generation of Minimal Structurally Overdetermined (MSO) sets, from which diagnosis tests can be obtained, from engineering documentation. The generation of conflicts and diagnoses is also explored. The authors evaluate the performance of various OpenAI LLMs (including o1, o3-mini, and 4o-mini) by comparing their generated MSOs, conflicts, and diagnoses against those produced by traditional tools. The paper situates its contribution within a growing body of work on LLM-based fault diagnosis but emphasizes its unique connection to foundational MBD concepts. An empirical evaluation highlights both the potential and limitations of LLMs in structured engineering tasks, offering insights into their applicability and future integration into hybrid diagnostic frameworks.

The main goal of this work is to evaluate LLM capabilities in the context of MBD fault diagnosis. While we find some results surprisingly good, at this point, LLMs do not provide an alternative to diagnosis algorithms. The current level of accuracy is not sufficient to replace traditional tools. Additionally, as the results are obtained by API calls, the time to achieve the results depends on the network load and other factors and cannot be bounded robustly. Another practical consideration is the cost. Lastly, it can be observed that the quality of the results vary highly in different runs.

Several avenues can be explored to extend this work. First, broader benchmarking on more diverse and complex systems, including hybrid systems or distributed systems, would help assess the generalization capabilities and scalability of LLMs in diagnostic tasks. Another promising direction involves fine-tuning or instruction-tuning language models on domain-specific corpora to improve their performance. Additionally, integrating multi-modal inputs – such as P&ID diagrams, circuit schematics, or CAD models – via vision-language models may allow more intuitive and direct processing of engineering data.

Furthermore, future work could explore the use of the recent Model Context Protocol (MCP) in diagnostic frameworks involving distributed architectures. MCP provides a structured way to manage model contexts and execution flows across different computational components. By integrating MCP with language models, it becomes possible to coordinate diagnostic reasoning over a set of specialized servers, each responsible for distinct functionalities – such as structural analysis, variable elimination, or diagnosis generation from conflicts. This would enable a modular and scalable diagnostic pipeline, where LLMs interact with context-aware services rather than operating as monolithic agents. On the other hand, this modular structure is particularly well suited for diagnosis in distributed systems, where different subsystems may operate independently, expose only partial observability, or follow different modeling paradigms. MCP can enable seamless orchestration of context-aware services across these subsystems, allowing LLMs to query relevant components dynamically and integrate partial diagnosis results into a coherent global view. Such an approach enhances scalability, reusability, and flexibility in large-scale or heterogeneous diagnosis environments, while also opening new opportunities for intelligent fault management across distributed infrastructures.

Figure 13: Comparison of F1 score, Precision, and Recall for MSO generation across different models. Each bar shows the mean value with standard deviation.
Figure 14: Comparison of F1 score, Precision, and Recall for minimal conflicts generation across different models. Each bar shows the mean value with standard deviation.
Figure 15: Comparison of F1 score, Precision, and Recall for minimal diagnoses generation across different models. Each bar shows the mean value with standard deviation.

References

  • [1] J-Ph Cassar and M Staroswiecki. A structural approach for the design of failure detection and identification systems. IFAC Proceedings Volumes, 30(6):841–846, 1997.
  • [2] Marie-Odile Cordier, Philippe Dague, Michel Dumas, François Lévy, Jacky Montmain, Marcel Staroswiecki, and Louise Travé-Massuyes. A comparative analysis of ai and control theory approaches to model-based diagnosis. In ECAI, pages 136–140, 2000.
  • [3] Akshay J Dave, Tat Nghia Nguyen, and Richard B Vilim. Integrating llms for explainable fault diagnosis in complex systems. arXiv preprint arXiv:2402.06695, 2024. doi:10.48550/arXiv.2402.06695.
  • [4] Teresa Escobet, Anibal Bregon, Belarmino Pulido, and Vicenç Puig. Fault Diagnosis of Dynamic Systems. Springer, 2019.
  • [5] Erik Frisk, Mattias Krysander, and Daniel Jung. A toolbox for analysis and design of model based diagnosis systems for large scale models. IFAC-PapersOnLine, 50(1):3287–3293, 2017.
  • [6] Sungmin Kang, Gabin An, and Shin Yoo. A quantitative and qualitative evaluation of llm-based explainable fault localization. Proceedings of the ACM on Software Engineering, 1(FSE):1424–1446, 2024. doi:10.1145/3660771.
  • [7] Abdullah Khan, Rahul Nahar, Hao Chen, Gonzalo E Flores, and Can Li. Faultexplainer: Leveraging large language models for interpretable fault detection and diagnosis. arXiv preprint arXiv:2412.14492, 2024.
  • [8] Mattias Krysander, Jan Åslund, and Mattias Nyberg. An efficient algorithm for finding minimal overconstrained subsystems for model-based diagnosis. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 38(1):197–206, 2007. doi:10.1109/TSMCA.2007.909555.
  • [9] Mattias Krysander, Jan Åslund, and Mattias Nyberg. An efficient algorithm for finding minimal overconstrained subsystems for model-based diagnosis. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, 38(1):197–206, 2008. doi:10.1109/TSMCA.2007.909555.
  • [10] Karol Kukla. jadlownik/FaultDiagnosis. Software, swhId: swh:1:dir:b8a40cc86347d60c8efb7ce10df9343400eee88e (visited on 2025-10-20). URL: https://github.com/jadlownik/FaultDiagnosis/tree/develop, doi:10.4230/artifacts.24965.
  • [11] Lin Lin, Sihao Zhang, Song Fu, and Yikun Liu. Fd-llm: Large language model for fault diagnosis of complex equipment. Advanced Engineering Informatics, 65:103208, 2025. doi:10.1016/J.AEI.2025.103208.
  • [12] Silke Merkelbach, Alexander Diedrich, Anna Sztyber-Betley, Louise Travé-Massuyès, Elodie Chanthery, Oliver Niggemann, and Roman Dumitrescu. Using multi-modal llms to create models for fault diagnosis. In The 35th International Conference on Principles of Diagnosis and Resilient Systems (DX’24), volume 125, 2024.
  • [13] Herbert Muehlburger and Franz Wotawa. Flex: Fault localization and explanation using open-source large language models in powertrain systems (short paper). In 35th International Conference on Principles of Diagnosis and Resilient Systems (DX 2024), pages 25:1–25:14. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2024. doi:10.4230/OASIcs.DX.2024.25.
  • [14] Herbert Mühlburger and Franz Wotawa. Faultlines-evaluating the efficacy of open-source large language models for fault detection in cyber-physical systems. In 2024 IEEE International Conference on Artificial Intelligence Testing (AITest), pages 47–54. IEEE, 2024. doi:10.1109/AITEST62860.2024.00014.
  • [15] CG Pérez-Zuniga, E Chanthery, L Travé-Massuyès, and J Sotomayor. Fault-driven structural diagnosis approach in a distributed context. IFAC-PapersOnLine, 50(1):14254–14259, 2017.
  • [16] Raymond Reiter. A theory of diagnosis from first principles. Artificial intelligence, 32(1):57–95, 1987. doi:10.1016/0004-3702(87)90062-2.
  • [17] Alicia Russell-Gilbert, Alexander Sommers, Andrew Thompson, Logan Cummins, Sudip Mittal, Shahram Rahimi, Maria Seale, Joseph Jaboure, Thomas Arnold, and Joshua Church. Aad-llm: Adaptive anomaly detection using large language models. In 2024 IEEE International Conference on Big Data (BigData), pages 4194–4203. IEEE, 2024. doi:10.1109/BIGDATA62323.2024.10825679.
  • [18] Anna Sztyber Betley. asztyber/LLM_diagnostic_concepts. Software, swhId: swh:1:dir:d265fcfc25d623bff2fa34d48590c7eb66d08995 (visited on 2025-10-20). URL: https://github.com/asztyber/LLM_diagnostic_concepts, doi:10.4230/artifacts.24966.
  • [19] Laifa Tao, Haifei Liu, Guoao Ning, Wenyan Cao, Bohao Huang, and Chen Lu. Llm-based framework for bearing fault diagnosis. Mechanical Systems and Signal Processing, 224:112127, 2025.
  • [20] Louise Travé-Massuyes, Teresa Escobet, and Xavier Olive. Diagnosability analysis based on component-supported analytical redundancy relations. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 36(6):1146–1160, 2006. doi:10.1109/TSMCA.2006.878984.
  • [21] Aidan ZH Yang, Claire Le Goues, Ruben Martins, and Vincent Hellendoorn. Large language models for test-free fault localization. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, pages 1–12, 2024.