Model-driven development and verification of fault tolerant systems
Javed, Kashif (2017-05-24)
Javed, Kashif
Turku Centre for Computer Science (TUCS)
24.05.2017
Julkaisu on tekijänoikeussäännösten alainen. Teosta voi lukea ja tulostaa henkilökohtaista käyttöä varten. Käyttö kaupallisiin tarkoituksiin on kielletty.
Julkaisun pysyvä osoite on
https://urn.fi/URN:ISBN:978-952-12-3553-5
https://urn.fi/URN:ISBN:978-952-12-3553-5
Tiivistelmä
Dependability is an ability of a computer-based system to deliver services that can be justifiably trusted. There is a wide range of computer-based systems that provide services that are critical for our society, e.g., nuclear power plants, transportation, healthcare, etc. Ensuring dependability of such systems constitutes an important engineering goal.
Dependability is an integrated notion that encompasses various system characteristics including reliability, safety, availability, etc. For a wide class of systems, the main engineering goal is to ensure a high degree of reliability – a probability of a system functioning correctly over a given period under a given set of operating conditions. Since the occurrence of faults can disrupt correct system behavior, to achieve the required reliability, we need to employ fault tolerance techniques.
The main goal of fault tolerance is to ensure that the system can deliver its services despite the occurrence of faults. Fault tolerance typically introduces some form of architectural or computational redundancy and hence, increases the complexity of the system design. Therefore, to ensure a correct implementation of fault tolerance mechanisms, we should develop the techniques that facilitate a structured analysis of system failure modes, systematic design of error recovery and reconfiguration procedures, as well as bolster system verification and validation.
Model-driven engineering allows the designers to cope with system complexity and analyze system behavior at different levels of abstraction. In our thesis, we aim at studying how to represent various static and dynamic aspects of fault tolerance in the model-driven development.
Often behavior of complex fault-tolerant systems is structured using the notion of operational modes – mutually exclusive sets of the system behavior. As a reaction on faults as well as different internal and external conditions, the system switches between its operational modes. The design of mode transition logic in distributed fault tolerant systems is a challenging and error-prone task. On the one hand, we need to ensure that all components of the system are put in the states required by a certain mode. On the other hand, we should verify that the components maintain these states while the system is stable, i.e., before the conditions for triggering a transition to another mode are reached. To facilitate the design of complex mode-rich fault tolerant systems, in our thesis, we demonstrate how to model distributed systems with centralized and distributed mode management as well as verify their mode transition logic. We validate the proposed approach in the aerospace domain.
To guarantee that the reconfiguration performed in response to the changed operating conditions achieves the required goal, fault-tolerant systems should incorporate the appropriate monitoring capabilities and rely on a set of explicitly defined rules for triggering adaptation. Usually, these rules establish a connection between globally observed system properties and behavior of system components, i.e., span over several architectural layers. To facilitate the development of such complex systems, in our work, we propose the generic patterns for architecting adaptive fault tolerant systems in a layered hierarchical manner. We demonstrate how the proposed patterns can be utilized in the context of data intensive fault tolerant systems.
Development and verification of fault-tolerant distributed systems also require the use of advanced verification technologies as well as techniques for supporting a disciplined systematic analysis of system failure modes. In our thesis, we demonstrate how to perform formal verification of a fault tolerant routing protocol and systematically identify system failure modes and recovery procedures using Failure Modes and Effect Analysis approach. While reliance on formal techniques increases confidence in the correctness of the implementation of the fault tolerance mechanisms, the use of a systematic inductive technique for identifying failure modes improves the completeness of the analysis.
The research work performed in our thesis aims at creating a support for explicit integration of fault tolerance consideration into the model-driven system development. We aim at creating a potentially industry-relevant approach and hence, utilize the modeling and verification techniques that are used in current industrial practice. Moreover, we validate our approach in a number of case studies from different domains.
Dependability is an integrated notion that encompasses various system characteristics including reliability, safety, availability, etc. For a wide class of systems, the main engineering goal is to ensure a high degree of reliability – a probability of a system functioning correctly over a given period under a given set of operating conditions. Since the occurrence of faults can disrupt correct system behavior, to achieve the required reliability, we need to employ fault tolerance techniques.
The main goal of fault tolerance is to ensure that the system can deliver its services despite the occurrence of faults. Fault tolerance typically introduces some form of architectural or computational redundancy and hence, increases the complexity of the system design. Therefore, to ensure a correct implementation of fault tolerance mechanisms, we should develop the techniques that facilitate a structured analysis of system failure modes, systematic design of error recovery and reconfiguration procedures, as well as bolster system verification and validation.
Model-driven engineering allows the designers to cope with system complexity and analyze system behavior at different levels of abstraction. In our thesis, we aim at studying how to represent various static and dynamic aspects of fault tolerance in the model-driven development.
Often behavior of complex fault-tolerant systems is structured using the notion of operational modes – mutually exclusive sets of the system behavior. As a reaction on faults as well as different internal and external conditions, the system switches between its operational modes. The design of mode transition logic in distributed fault tolerant systems is a challenging and error-prone task. On the one hand, we need to ensure that all components of the system are put in the states required by a certain mode. On the other hand, we should verify that the components maintain these states while the system is stable, i.e., before the conditions for triggering a transition to another mode are reached. To facilitate the design of complex mode-rich fault tolerant systems, in our thesis, we demonstrate how to model distributed systems with centralized and distributed mode management as well as verify their mode transition logic. We validate the proposed approach in the aerospace domain.
To guarantee that the reconfiguration performed in response to the changed operating conditions achieves the required goal, fault-tolerant systems should incorporate the appropriate monitoring capabilities and rely on a set of explicitly defined rules for triggering adaptation. Usually, these rules establish a connection between globally observed system properties and behavior of system components, i.e., span over several architectural layers. To facilitate the development of such complex systems, in our work, we propose the generic patterns for architecting adaptive fault tolerant systems in a layered hierarchical manner. We demonstrate how the proposed patterns can be utilized in the context of data intensive fault tolerant systems.
Development and verification of fault-tolerant distributed systems also require the use of advanced verification technologies as well as techniques for supporting a disciplined systematic analysis of system failure modes. In our thesis, we demonstrate how to perform formal verification of a fault tolerant routing protocol and systematically identify system failure modes and recovery procedures using Failure Modes and Effect Analysis approach. While reliance on formal techniques increases confidence in the correctness of the implementation of the fault tolerance mechanisms, the use of a systematic inductive technique for identifying failure modes improves the completeness of the analysis.
The research work performed in our thesis aims at creating a support for explicit integration of fault tolerance consideration into the model-driven system development. We aim at creating a potentially industry-relevant approach and hence, utilize the modeling and verification techniques that are used in current industrial practice. Moreover, we validate our approach in a number of case studies from different domains.