Step by Step process to identify Failure modes in distributed systems.

By the end of this article, you will get a step by step guide on how to brainstorm for failure modes and a framework to list and priorities failure modes for any application architecture

Example application architecture

Let’s begin, shall we?

Step 1 :

Understand the application architecture, and have its functional and nonfunctional requirements handy. I know it sounds obvious, but, I cannot emphasize it more on it.

Step 2:

Start with the topmost layer of the architecture. “The application Layer”.

Step 3:

Make a list of each layer and component of the architecture.

Step 4:

Focus on one component at a time.

Possible failures in Kafka broker
Schema Registry Failure
Control Center Failures
Example of possible failure mode of the architecture

Step 5:

List down all the failure mode, brainstormed in the above exercise in a worksheet having columns mentioned below.

Step 6:

List down the technical impacts of each failure mode.

Step 7:

Categorise common failure modes into the worksheet.

Why categorising you ask ?

Step 8:

Execute failure modes as per the priority defined by the business.

Step 9:

Capture error and warning logs while executing failure scenarios.

Step 10:

Collaborate with business to prioritize the failure modes as per functional and nonfunctional impact.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store