Chaos Engineering to Improve System Resiliency

George Ukkuru, Head - Quality Engineering, UST GlobalAn Agile Scrum Master, George boasts of having close to two decades of experience, during which he was associated with well known tech companies such as Caravel, Sunquest Information Systems and SAP Labs India, prior to joining UST Global in 2008.

In the present techsavvy world, random glitches in systems have become harder to predict and nearly impossible to afford by companies. These random failures impact a company’s bottom line, making downtime a key performance indicator for engineers. These glitches can be a networking glitch in one of the data centers, a misconfigured server configuration, shutting down of a node or any other kind of failure that propagates across systems. These outages usually bring catastrophic results and severe downtime in the regular functioning of a system.

One single hour of an outage can cause millions of dollars to a company. As per Gartner, the average cost of IT downtime is $5,600 per minute. Since there is a difference in the way each business operates, the cost of downtime can vary between $140,000 per hour to $540,000 per hour. As organizations cannot wait for an outage to happen, one should look at proactively identifying system weaknesses and applying chaos engineering practices to mitigate the risks.

Chaos Engineering studies how large scale systems respond to all the random events. It is a disciplined approach to identify failures before they become outages. By testing the ways, how a system responds under stress, engineers can quickly identify and fix faults. The ultimate purpose behind chaos engineering is to limit the chaos behind outages caused by random events by carefully investigating ways to make a system more robust. While practicing chaos engineering, planned experiments are performed on the systems to check the response of a system when such a situation occurs.

Originally, Chaos Engineering was Netflix’s rationale as they needed to be resilient against random host failures while migrating to AWS(Amazon Web Services). This resulted in the release of Chaos Monkey by Netflix in the year 2011.
Additional failure injections were added on top of Chaos Monkey that allowed testing of more states of failures and build resilience to those. Netflix also decided to introduce a new role called Chaos engineering in the year 2014. And, then Gremlin announced Failure Injection Testing(FIT)tool built on the concepts of the Simian Army to build resilience in the systems against random events. With many organizations moving into cloud and microservice architecture, the need for chaos engineering has increased in recent years. Many larger technology companies like Amazon, Netflix, LinkedIn, Facebook, Microsoft, Google, and a few others are happily practicing Chaos Engineering to improve the reliability of their systems.

As the cost of downtime is high, the organization should take a proactive approach to prevent crashes by applying chaos engineering practices

Chaos Engineering works on the principle of running thoughtful experiments within the system, which brings out insights on how the system responds in case of failures. There are three steps involved:-
Step-1:The first step is to identify a fault that can be injected and create a hypothesis on the expected outcome by mapping IT or Business metrics.

Step-2:It involves the execution of a test to measure the parameters around the availability and resilience of a system. The tests are focused around creating a failure by increasing CPU Utilization or inducing a DNS outage.

Step-3:This is the last step and determines the success of the tests. The tests are halted if there is an impact on the metrics, and the failures are analyzed. The chaos experiment is considered successful only if a failure occurs. The tests are repeated by increasing the blast radius if the system is found to be resilient.

After the completion of the experiment, the insights obtained provide information on the realworld behavior of the system during random failures. This helps engineering teams to fix issues or define roll back plans. Introducing Chaos Engineering in the organization brings in both business as well as technical benefits. For the business, Chaos Engineering helps in preventing significant losses in overall revenue, improves the incident management response, improves on call training for engineer teams and the resiliency of the systems. From the technical point of view, data obtained from Chaos experiments results in increased understanding of system failure modes, improved system design, reduction in repeated incidents, and on-call burden.

There are many tools which are available in the market for letting companies practice Chaos Engineering. Chaos Monkey, Gremlin Inc., Simian Army are a few tools to name which can be easily implemented in the organization. Organizations can also build their own Chaos Engineering tools using code from open source tools. The process may be time-consuming and expensive but gives complete control over the tool, options to customize it and is more secure.

Predicting system failures has become difficult due to complex application architectures. As the cost of downtime is high, the organization should take a proactive approach to prevent crashes by applying chaos engineering practices.