Today, via a posting on their Tech Blog, Netflix announced the long awaited release of their failure inducing ???Chaos Monkey??? tool:
Chaos Monkey is a service which runs in the Amazon Web Services (AWS) that seeks out Auto Scaling Groups (ASGs) and terminates instances (virtual machines) per group. The software design is flexible enough to work with other cloud providers or instance groupings and can be enhanced to add that support. The service has a configurable schedule that, by default, runs on non-holiday weekdays between 9am and 3pm. In most cases, we have designed our applications to continue working when an instance goes offline, but in those special cases that they don’t, we want to make sure there are people around to resolve and learn from any problems. With this in mind, Chaos Monkey only runs within a limited set of hours with the intent that engineers will be alert and able to respond.
In other words, Chaos Monkey is a tool used to simulate failures in ???cloud??? services so that the operators can be better prepared for unexpected outages. By inducing failures in the system, developers are able to implement fixes and contingencies on their own terms, rather than waiting for a serious problem to develop before being able to deploy countermeasures.