Implementing “Andon (行灯)” in DevOps
Stop to Speed Up 😃
Let’s first understand the meaning of Word “Andon (行灯)” in Japanese,
Andon(行灯) means — fixed paper-enclosed lantern; paper-covered wooden stand housing an (oil) lamp
Dictionary — https://jisho.org/search/行灯
Toyota Production Systems (TPS) has introduced the word “Andon”. “Andon Cord” is a Lean manufacturing principle and tool used to notify management, maintenance, and other workers of a quality or process problem. The concept revolves around a device incorporating signal lights to indicate which assembly line workstation has a problem. Normally alerts are activated manually by a worker using a pull cord (Andon cord) or button or may be activated automatically by the production equipment itself. The idea behind is that by stopping the system you get an immediate opportunity for improvement, or find a root cause, as opposed to letting the defect move further down the line and be unresolved.
In the case when Ignored, In Steven Spear’s “The High-Velocity Edge”, he describes a horrifying story of missed opportunities leading up to the 2003 NASA Columbia space shuttle disaster. The short version of the story is that the thermal protection system on the left wing was damaged just after launch but didn’t become an issue until reentry 19 days later. After the disaster, an investigation board charged with reviewing the accident found there were at least eight attempted signaling events to notify the crew requesting that they need to “go-see” the damage, But nothing is done which led to disaster.
Now, what The Andon Cord means in Software Development?
During development, bugs are talked about openly, but the project moves forward regardless, and the bugs continue down the process.
This is where DevOps comes in and can incorporate the Andon Cord concept. There are two specific areas that come to mind, first, if an organization has truly flipped the testing pyramid and put full automated testing in place in conjunction with Continuous Integration, Specification by Example and a tool such as SonarQube, this can be the first place the Andon Cord concept can be employed. By forcing all new code to run the gauntlet of full automated testing (driven by Specification by Example) and SonarQube’s quality gates you are making sure it meets the expected behaviors specified by the customer and the code quality is in line with expected standards.
The second area DevOps can function as a sort of Andon Cord is with A/B Testing. When an organization has put into place a fully automated delivery pipeline they are able to quickly get code out to subsets of their customers (often the same day the code was developed) in order to create Feedback Loops that enforce that they are building what their clients want.
Are any industry giants really using it? Answer, YES.
Amazon and the Cord :
The Andon Cord has become a metaphor for some modern-day Web-Scale organizations as well. Jeff Bezos, the CEO of Amazon, described in a 2013 letter to the Amazon’s shareholders a practice he called the Customer Service Andon Cord. This was an established practice of metaphorically pulling an Andon Cord when they noticed a customer was overpaying or had overpaid for a service. Amazon would heuristically scan their systems looking for these kinds of potential customer service mismatches. These were considered defects at Amazon because they had a vision of being an organization that was always customer-centric. They would automatically refund a customer, without the customer even asking, if the service delivery was suboptimal. I have had this happen to me on a few occasions watching a movie on Amazon Prime, where the next day I received an email telling me they refunded my movie rental cost due to poor quality.
Netflix and The Chaos Cord :
Another example of an Andon Cord metaphor used in Web-Scale businesses is at Netflix. Netflix has an interesting way of exercising their Andon Cord, although they don’t actually call it an Andon Cord.
At Netflix, they actually inject this into their systems on purpose by intentionally trying to break systems in production. They have developed what is now famously called Chaos Monkey. Chaos Monkey is a process that randomly kills live running production servers. This behavior is known by everyone who works at Netflix. It’s part of their culture. There are no surprises about this practice. Developers plan their code and systems accordingly. As Told by Adrian Cockcroft, one of the primary architects behind Netflix’s IT infrastructure, that not knowing about the Chaos Monkey mode coming into a job interview at Netflix was pretty much an immediate no-hire decision.
Summary,
— NO permission needed to pull cord (open to everyone)
— DO NOT bury down an issue in useless paperwork and never-ending meetings (Act with priority)
— NO defect was too small
— Even if the cord was mistakenly pulled, the response should never be negative. (Build Trust)
— It’s not a tool that matter but culture and behavior behind the tool are important (Build Culture)
— Solving problem is NOT the goal, understanding how to solve the problem is a Goal.
Furthermore, the process of solving the issues can be controlled by a practice described by Dr. Edwards Deming called Plan Do Change Act (PDCA). PDCA loop. Plan (P) a countermeasure, implement the countermeasure (D), check or study the results (C), and act on the results either it’s fixed or start the next countermeasure (A).
Another point here is that implementing an Andon Cord in an organization is not something you do overnight. It takes a continuous improvement roadmap to get there and must have behavior reinforcement built into the process. It takes a fierce commitment and practice of improvement and an equally skilled leadership coaching approach. If you want to investigate the concept more deeply, recommend book — Mike Rother’s “Toyota Kata”.
Last but not least,
Do we need an actual physical device, maybe not. We can have mail or slack which virtually works as an “Andon Cord”. Personally, I would like to do a small DIY Project “Andon using Raspberry Pi 3”, which is basically building small and easy Andon cord using Raspberry Pi 3, so that the team can access it through the web and signal the problems. I will work on this and I will share implementation details in another blog.