Failure Modes for distributed applications

FMECA

If anything can go wrong, it will.

Murphy’s first law

Table of content· What is FMECA or FMEA?
· Why do we need it?
· Tangible benefits of FMECA
· Type of FMECA/FMEA?
· Levels of FMECA
· How to do Brainstorming for FMECA?

Waking up in the middle of the night due to the alerts caused by the failure of applications is never a pleasant experience.

Neither it is pleasant for business because of the impact it causes on its reputation or sometimes regulator penalties which businesses have to pay due to application downtime.

One of the reasons for this post is to share my experience in following a structured approach to analyse and execute various failure modes in a distributed system (Kafka) on the GCP.

It has helped us to identify possible failure causes and their effects. And implement various failure detection mechanisms to proactivity monitor the system and prevent failures.

Distributed Systems are made up of many moving. components.

Distributed system architectures are complex due to the many moving components which are prone to various types of failures.

Analysing possible failures upfront with a structured approach is important to avoid possible downtimes in a live system. The process I have followed is known as FMECA or FMEA.

What is FMECA or FMEA?

Failure Mode Effective Criticality Analysis, often known as “Failure modes”, is the process of reviewing all the components and the integrations of the system architecture to identify potential failure modes, their causes and effects.

Why do we need FMECA?

The intent of the Failure Mode, Effects & Criticality Analysis methodology is to increase knowledge of risks and prevent failures.

By conducting several failure modes on the live-like system and enabling various preventive and detective measures for them, we are improving the reliability, resilience, safety and quality of the application’s architecture.

This comprehensive exercise enables us to establish relationships between causes of failure and its effects, along with the criticality of corrective actions.

As per my experience, the more coverage of the failure scenarios of the system, the lesser the occurrence of midnight escalations when compared to the system where FMECAs are not conducted.

Tangible benefits of FMECA be offered into below categories.

1. Design and Development Benefits

  • Increased Reliability, Maintainability, Serviceability.
  • Early identification of single failure points (SFPS) and system interface problems.
  • Provides a documented method for selecting a design with a high probability of success.
  • Criteria for early planning of tests cases for the architecture.
  • Reduced development time and re-design.

2. Operations Benefits

  • More effective Control Plans.
  • Improved Verification and Validation testing requirements.
  • Provides an effective method for evaluating the effects of proposed changes to the design.
  • A basis for in-flight troubleshooting procedures and for locating performance monitoring.
  • Optimised preventive and predictive maintenance.
  • Provides runbooks and artefacts to execute in case of failures.

3. Cost Benefits

  • Recognise failure modes in advance (when they are less costly to address).

Type of FMECA/FMEA

Omission Failure

This class of failure is usually around communication between senders and receivers in the application system.

For example, the server is too slow to respond back to the receiver due to the slowness of the communication link between sender and receiver. This type of failure usually occurs when communication links between the component are overloaded or misconfigured.

The response from the streaming application to the consumer will be slow in the above case due to the high latency between the streaming application and the Kerberos server.

Crash Failure

This class of failure is where the process stops responding due to continuous omission failure.

For example, the server process stops responding to a new request from consumers due to exhausted resources on the process box. Before a process even reaches crash failure states, it starts showing symptoms of slowness due to omission failures.

Timing Failure

This type of failure occurs when the response from the process is too slow. Causing delays in the processing of client applications. Usually, performance failure falls under this category.

Response Failure

Class of failure where process responds with incorrect data or incorrect response. Most of the data related failures fall under this category.

Arbitrary Failure

This class of failure is considered to be the worst type of failure where any type of error may occur. A process may produce arbitrary responses at arbitrary times. For example, a channel can produce duplicate messages or a channel can corrupt messages. This type of failure is less easy to detect and has a profound impact on the system.

Levels of FMECA

It is a wise idea to break down distributed application architecture into various layers stacked over each other. The application layer being the topmost is the most impacted layer in the architecture. Any failure which occurs in layers beneath it will affect the application layer.

Below are the examples of layers that commonly exist in distributed application architecture.

Where to introduce FMECA in application lifecycle?

I recommend conducting FMECA in an environment which has all the third-party integrations enabled with the system. Also the environment should be production-like if not production, where one can execute destructive failure modes frequently to test the resilience of the system.

Sample scenarios to provide an example of failure modes

Following is the basic high-level design of Kafka on GKE (Kubernetes) . The architecture is to provide a point of view on FMECA, though, the approach will be similar for analysing failure modes in general but in production, there are a lot more integrations and communication with various other components making it even more complex. Such as integration with private PKI for security, LDAP for Authentication etc.

Each arrow in the above diagram is a communication/integrations channel between components.

Each component and its integration is an added complexity to the system and can be prone to failure.

In order to collate and determine the severity of the failure modes, we need to capture the failure scenarios and their resulting effects on the rest of the system.

I have used a framework and an approach which I will share in the next post on “How to brainstorm for failure modes”.

Following are the sample elements from the framework to capture the failure mode.

  1. Priority: Criticality of the failure based on the severity of business and technical impacts.
  2. Failure Description: Section describing the failure scenario.
  3. Business Impact: Impact on business function.
  4. Technical Impact: A failure scenario that can lead to business impact if not handled within certain time period.
  5. Prevention: Measure to prevent failure modes, for example, by providing enough resources like CPU, memory, disk etc depending on the failure mode.
  6. Detection: Measures to monitor components resources proactively to prevent failures.
  7. Resolution/Investigation: Steps to investigate and resolve the failure modes. For example runbooks for support people.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store