Resilience Testing Distributed Systems with Fuzzy Monkey Testing

Fuzzy Monkey testing distributed systems - Red Colobus from Wikipedia

One of the keys to good software is good testing. There are well-known testing suites for back end code – things like junit and py.test. There are also good front-end testing tools – things like Selenium. But for testing distributed systems there aren’t so many well-known tools – because the problem is quite different, and harder. In this blog post we’ll cover the “Fuzzy Monkey” methodology used for testing three different successful distributed systems (including the Assimilation Suite) – its history and how and why it works.

There Is Always Another Way

always another way

When it looks like you’re stuck and it seems like you have no way out, if you’re willing to admit you were wrong, perhaps you can find another way to solve your problem. Although the story I tell below is a software development, manufacturing and product management story, the moral applies in lots of places. Solve the problem you actually have, not the one you think you have. I learned something important from this story that I value to this day – there’s always another way.

Finding what’s hidden in plain sight

Back in the 90s I was involved with about 100 other people in a project to develop a new voice mail system – software, hardware and firmware. The hardware was a completely new design, and the software was about 70% new. Along the way we stumbled into something that improved our end quality in a way that can reasonably be described as stunning. What we discovered was how to ask questions in a way that brought important things that “everyone knows” (and are effectively hidden in plain sight) to the attention of those who can do something about it.

Welcome to Assimilation Systems Limited

I’m Alan Robertson, founder of Assimilation Systems Limited – this is my first blog post about the company. Let’s start this first post with a little history of the project and the company.

I founded the Assimilation Project back in 2010, as a result of thinking about a really big supercomputer (over 2 million cores) I was working with which had a very unusual networking architecture. It was a very cool and odd computer. Along the way, I puzzled over how one could effectively monitor it in the presence of this non-traditional networking topology – without using the built-in monitoring hardware (which would be like cheating). After a while, I realized I knew how to make monitoring on normal computers scale in a way that seemed really interesting. Being a techie at heart, I was really jazzed and decided I had to implement it…