Resilience Testing Distributed Systems with Fuzzy Monkey Testing

Fuzzy Monkey testing distributed systems - Red Colobus from Wikipedia

One of the keys to good software is good testing. There are well-known testing suites for back end code – things like junit and py.test. There are also good front-end testing tools – things like Selenium. But for testing distributed systems there aren’t so many well-known tools – because the problem is quite different, and harder. In this blog post we’ll cover the “Fuzzy Monkey” methodology used for testing three different successful distributed systems (including the Assimilation Suite) – its history and how and why it works.

Finding what’s hidden in plain sight

Back in the 90s I was involved with about 100 other people in a project to develop a new voice mail system – software, hardware and firmware. The hardware was a completely new design, and the software was about 70% new. Along the way we stumbled into something that improved our end quality in a way that can reasonably be described as stunning. What we discovered was how to ask questions in a way that brought important things that “everyone knows” (and are effectively hidden in plain sight) to the attention of those who can do something about it.