Updated: Mar 3
The following piece is not out of a Management Textbook, but rather the result of the cumulative years of experience of a few consultants working with companies trying to improve their work practices
One of the most common organizational challenges we’ve experienced is the practice of finding true causes. This deep-rooted problem is embedded in the incorrect ways organizations are defining a root cause versus a technical cause. Root cause identification is about finding “What happened, how it happened, and why it happened” in that specific order.
You cannot identify how something happened unless you know what happened and similarly you cannot identify why it happened unless you know how it happened or at least what happened. Once you understand this, you can make good progress in identifying the root cause quickly and accurately with a permanent solution. What happened is the most neglected part of this equation as this is the correct starting point for your analysis. No wonder so many teams are resorting to “trial and error” practices, because they really have not established what the problem is.
Let’s take an example – The New York hub is experiencing the dropping out of website connections but nobody else in their Global Company is experiencing this fault. If you are looking at possible causes at this stage without understanding the dropping out situation you might attempt many “trial and error” solutions with a very poor success rate. If you were to stand back for a few minutes, review and consider the facts would allow you to understand exactly WHAT happened.
ABC Router dropping out
Finding the WHAT will get you to describe the problem more accurately
HOW it happened
Dropping out is due to increased volume beyond the threshold
When you know the HOW you can remove the limitation to restore functionality. You would also be able to take AVOIDING actions to avoid recurrence. This normally refers to an event in time! (We refer to this as the TECHNICAL CAUSE)
WHY it happened
The router specs are inadequate for volume spikes
This normally points to the root cause, which would be a “condition that exists”. You need to fix the router situation in such a way that it would be able to handle the increase in volume spikes!
How do we do this in our own work environments? Someone needs to take the lead (when our consultants are involved, we normally perform this in a facilitation function) and to ensure we have the correct WHAT happened facts to get the most informed and involved sources to contribute. Employ a repeatable deliberate approach (framework with a process) to gather the correct verified data to answer the questions for HOW it happened and WHY it happened.
Client Success Story
An international banking IT team had a problem with the slow performance of a trading website for the last 6 days. They had the best minds, mostly directors, which we believe are too far removed from the problem to be of any use. They were trying to resolve the “SLOW PERFORMANCE OF THE WEBSITE”, which resulted in many trial and error actions without success.
We got involved and wanted to know the facts of what was actually happening, and what was not supposed to happen. After going into the logs, we noticed that a particular type of trading transaction was entered and did not execute. We settled on the description “Futures transactions not executing”. The team took immediate action to circumvent this fault and restored the functionality of futures trading. However, we still had to solve the WHY or the root cause. The team overcame the immediate problem with an interim action and now had more time to get to the bottom by finding the company condition that caused the problem in the first place.