Testing Distributed Systems With Fuzzy Monkey Testing

Fuzzy Monkey testing distributed systems - Red Colobus from Wikipedia

One of the keys to good software is good testing. There are well-known testing suites for back end code – things like junit and py.test. There are also good front-end testing tools – things like Selenium. But for testing distributed systems there aren’t so many well-known tools – because the problem is quite different, and harder. In this blog post we’ll cover the “Fuzzy Monkey” methodology used for testing three different successful distributed systems (including the Assimilation Suite) – its history and how and why it works.

The best-known of these three packages is Pacemaker – which has had a well-deserved reputation of being rock solid for many years. A good reason why it’s rock solid is that it’s well-tested. This property of being well-tested was a significant contributing factor to why Pacemaker eventually replaced a series of three different high-availability packages created by a certain well-known and well-funded vendor who tested their products manually. In this article, I describe this methodology, and our experience with it.

Why Testing Distributed Systems Is Hard

With normal testing tools tools, you set up a test case, then you have some assertions about expected behavior, and you run through the list of tests. An important thing about distributed systems that’s different is that you can run the exact same test case twice and get two different correct answers.

This is because events in distributed systems don’t happen synchronously. Events occur when they do, and they are observed stochastically at a later time. Events might occur in an order (A, B), but be observed as having happened in the order (B, A). Moreover, if there are several observers and more complex event sequences, they might be observed in a different order by each observer. Of course it’s worse than that – if you try and inject two different events into a distributed system in rapid sequence, they might not actually complete in the order you attempted to inject them in. As though that weren’t bad enough, the correct action to take often depends on the current state of the distributed system.

If you have written distributed system software, you already know that the hardest bugs to find and fix are timing bugs – and if you haven’t, perhaps the discussion above has given you a hint about what a problem timing issues are in distributed systems.

There’s an additional issue which can interfere as well – it’s actually hard to tell when a distributed system test is done. In a unit test framework, typically a test is complete when a function returns a value, or a query returns a result – something like that. But in distributed systems recovering from changes in configuration (like bringing in a new system, or having one crash), it can be difficult to tell when the system has finished making the transition to its new stable state. This is a consequence of all the things we mentioned before – its asynchronous nature, the fact that you don’t know how it will actually respond to the stimulus of a test, nor what order things will happen in.

Since timing problems are often the most difficult to find – that’s where you want to expend most of your effort. Unsurprisingly, trying to find timing problems is exactly the case where the results are most often uncertain. One time you hit in front of a certain timing window, one time you hit it after the timing window, and one time you it square on.

So, pretty clearly you want a test methodology that has these properties:

tests are tolerant of the order events are observed
tests are run repeatedly
tests change the configuration of the system under test as they run – so that repeated tests occur in different circumstances
tests have a way of telling when they’re done
tests have a way of telling if they failed or succeeded

Testing Distributed Systems with CTS

CTS is a Fuzzy Monkey style random test tool I wrote for the Linux-HA (Pacemaker) project. It’s based on syslog messages and works like this:

Each system in the test setup redirects its logs (over TCP) to the machine running the test control software.
Each iteration, CTS selects a test at random from its test bank and runs it
Each test takes an action and expects a set of regular expressions to be found in the consolidated system log. These regular expressions might be constant but are more commonly generated by the test case. The normal case is that these messages might be observed in any order, but a test can specify that they must occur in a certain order.
Most test cases randomly choose an appropriate system or systems to operate on. It might be a system which is up, or is down, or has some other specific criteria.
The software under test is written so that certain strings (typically “ERROR:”) only occur when the software runs into a situation that it is not prepared to deal with. Such events go by different names things like “This Can’t Happen” or “Oh Darn!” events (feel free to substitute your favorite expletive here).
The software under test is expected to be written in a defensive fashion and detect violations of conditions that should never occur and log “Oh Darn!” messages to syslog when they occur.
When all the regular expressions for a test have been found, the test is complete.
At the end of the test, the distributed system should be in a new stable state, so project-defined audits are performed to ensure that the system is in a self-consistent state.
Tests are typically “white box” tests. Because the intent is to generate timing issues, a clear understanding of how the software works and its expected weak points can help to generate more effective tests. As always, having a test creator who is unmerciful and a bit diabolical is a good thing.

A test is said to succeed if the following conditions are met:

All its expected “success” messages are received in the expected timeframe
No “Oh Darn!” events were observed
The system passes the post-test audits. For Linux-HA, the post-test audits were fixed and verified things like ensuring that each resource was running in exactly one place. In the case of the test suite I wrote for the Assimilation Suite, each test could also specify its own Neo4j database query which had to produce a certain output – typically the expected number of rows.

A typical test suite might consist of a few dozen tests which are run repeatedly – for a production grade release, thousands of tests might be run, and they might take hours or even several days to complete. As the tests run, they tend to perturb the configuration of the test environment, so that the next time test A runs, it may be running in a different configuration than before. Because the systems to operate on are chosen at random, even if the environment is the same, the test is unlikely to be the same.

My History in Testing Distributed Systems

Around 1990, I was involved in testing the Definity Audix voice mail system – in particular its voice play/record subsystem. To test it effectively, I created a similar random testing methodology which I discussed briefly in another blog post. Because the intent was to break things unmercifully, we called that form of testing “Bamm-Bamm” testing. That method was so effective such that once the voice system could run my tests successfully for five minutes, no more bugs were found in the field. Since I like to do what works, that success significantly affected my approach when creating CTS for the Linux-HA project.

In 1998, I started work on the Linux-HA project (now called Pacemaker). At the time, I was nearly the only developer on the project. In 2000, I realized that too many bugs were getting past me. In particular, I put out two releases in rapid succession both of which had bugs that I should have caught. I didn’t catch them because good cluster testing is hard to do by hand – and I wasn’t very good at it. I’d had it with putting out broken software. So, I sat down and thought about how to test it – and came up with CTS. Over time we observed that it was much better at exercising the system and finding bugs than seemed likely. One of my big concerns was that we would have bad things occur during the tests that we didn’t detect. That turned out not to be the case. Eventually, this software got adopted by the Corosync project, and more recently the methodology was adopted by the Assimilation project. These tests are largely centered on resilience testing – recovering from machines or services failing.

What We Learned Testing Distributed Systems

Even before I left the Linux-HA project in 2008, we had had many years experience using CTS. Here are a few surprising things that we ran across:

Having really nice fast identical hardware delayed the detection of bugs. It seemed to find certain bugs faster when we had a mix of fast and slow machines in the test environment. Although this is bad practice to follow in a production environment, it turned out to be a great thing to do in the test environment.
Sometimes tests weren’t done when we thought they were done. In other words, after the last message went into syslog, the system continued to churn for a bit. This would occasionally result in the next test failing. It was usually hard to figure out what caused the that test to fail. Sometimes there was already another message in syslog we could wait on, and sometimes we had to move one or add a new one to the code being tested.
Breaking tests was easy (not an uncommon experience). If we had made substantial changes to a test, or the assumptions the tests made, it was useful to perform a non-random test where each test was run once. That way you could validate that the tests themselves all still worked without waiting for it to fail a few hours into a test run.
Sometimes a certain failure could be seen when test B happened right after test F ran. When we observed that, we added a test that consisted of F-followed-by-B. Then we could do a test run with just that one test over and over until it failed.
You can’t use standard unreliable UDP for log messages. You really need to use rsyslog, or syslog-ng with TCP log forwarding.
Getting real hardware set up with all the right log forwarding was painful and tended to be easy to accidentally break. This was the basic assumption of CTS – that the systems were pre-configured by hand before the tests start. In the Assimilation Suite test software, I dynamically create appropriate Docker containers when the tests start. Given how fast Docker images start up, this works out very well. Although I’ve run into a number of Docker bugs, it has been well worthwhile.

Why call it Fuzzy Monkey Testing?

CTS is the name of the software I invented for Linux-HA (Pacemaker), and is also used by Corosync. For the Assimilation Project, I kept and enhanced the Python LogWatcher module from CTS, but for a variety of reasons (like using Docker) I rewrote the higher level software. I didn’t call it CTS, not only because I rewrote the top layer, but because the Assimilation Suite operates across an entire computing environment, not just a single cluster.

Fuzzy comes from the term Fuzz Testing (coined by Barton Miller in 1988) because we run random tests, and try hard to fuzz the timing and configuration, and the name Chaos Monkey (coined by Netflix circa 2010) since we also try and break things and watch what happens. Since we do a bit of each – we fuzz many things about the tests, and mostly the tests are breaking things and watching how things recover – like Chaos Monkey for your test environment. I hadn’t really thought of it as being fuzz testing until I described it to Emily Ratliff – who recognized it immediately. Although CTS was created within a few years of the founding of Netflix, it predates Chaos Monkey by many years. And the combination of Fuzz and Monkey seemed to naturally go together to create the term “Fuzzy Monkey” – although fuzzy chaos has a certain appeal ;-).

Results from Testing Distributed Systems

In the case of Linux-HA/Pacem aker, once we developed a test to exercise a bug in a certain capability, that bug never showed up again in the field. This was typically accomplished with a smallish number of test scenarios – numbered in the dozens rather than in the hundreds or thousands. Given how vague the general outline of the methodology, how simple it is, and how imprecise the success criteria are, it has always surprised me how well it worked. But experience suggests that it works very well indeed. Apparently these soft criteria (certain messages appearing in logs) combined with the anti-pattern of “no bad messages” along with sanity audits is quite effective in showing that the system worked – even in the presence of many possible reactions to a test.

Certainly it solved the problem I created it for – keeping me from feeling like an utter idiot ;-). It’s worth noting that I had problems in testing the Assimilation Suite until I introduced Fuzzy Monkey testing into that project. As it turns out, this saved me from a number of bugs getting out into the wild. If you run this in your test environment, then your system is far more likely to survive the chaos the real world (or the Chaos Monkey) brings your way.

Assimilation Systems Limited

Award-Winning Highly Scalable Discovery-Driven System Management Suite

Resilience Testing Distributed Systems with Fuzzy Monkey Testing