Improving Software Quality
Back in the 90s I was involved with about 100 other people in a project to develop a new voice mail system – software, hardware and firmware. The hardware was completely new design, and the software was about 70% new. Along the way we stumbled into something that improved our end quality in a way that can reasonably be described as stunning. This blog tells the story of how we learned to ask questions in a way that brought important things that “everyone knows” but are hidden in plain sight to the attention of those who can do something about it.
Setting The Stage – Quality Concerns
We were just a few months away from our first controlled introduction (CI) or beta customer – and things didn’t really feel right – lots of the system was best described as squishy – that is, not really solid. They passed unit tests, and the system tests worked continually for days or weeks without failures, but then something would fail. These squishy components included key subsystems like voice record/play – vital to the system. For a voice mail system, this is having a bank that usually records your paycheck correctly. Not good enough.
About this time, our second level manager (2LM) declared that everyone will work 60 hours a week on site until our CI. As it turns out, my job had been to lead the device driver team (we had about 18 new UNIX device drivers – and I was the only experienced UNIX device driver writer) – and all of these were quite solid – for reasons worthy of another article. But returning to this article… As was my personal style, I had been winding down and celebrating no longer being able to break anything by spending time in what are sometimes called Beneficial Scholastic (BS) discussions – avoiding anything that looked like work for a few days. So, when our 2LM declared that we were all to work 60 hours a week, I wondered what I could possibly do with 60 hours a week. So I asked her what we were supposed to do with those 60 hours. Her reply? Do the same as you are doing – only more so. Those of you who know me know I have a certain fondness for BS discussions – but I was certain that not even I could spend 60 hours a week in BS – not to count that it would be a bit on the unproductive side.
I knew there was a good bit of squishy code out there – you could feel it in the air – in the hallway and water-cooler talk. If I was going to spend 60 hours a week on-site I wanted it to be for something that mattered. I wanted to get the voice play/record code fixed – now that would matter. But it wasn’t my code – and deciding to start fixing or writing tests for someone else’s code is definitely not the kind of thing to endear you to others. I also knew that this wasn’t the only squishy code out there. So I decided to try another approach.
In parts of this company (like many) you can occasionally run into a shoot-the-messenger mentality hiding not very far down. I had an idea on how to redirect our effort in a much more profitable way – but I didn’t want to get shot as the bearer of bad news. So I asked our 2LM – “If I knew how we should be spending our time most effectively, would you want to know?”. To her credit, she replied “Of course!”.
Knowledge Hidden in Plain Sight
So I went off to carry out my idea. My idea was simple – I interviewed most of the people in the organization and set the emotional stage before asking them a few questions, wrote down the answers and put them into a short (5-page) report. As we were about to find out, everyone already knew what needed doing, but somehow that knowledge was dispersed in the organization in such a way that it couldn’t be acted on. It was, in effect, hidden in plain sight.
For the purposes of this posting, imagine March 23 was the cut date for our first customer. My interviews with the project engineers went something like this:
Imagine it’s the morning of March 23.
You come in to work.
Most of the offices are dark…
It’s very quiet…
Those people who are here, just came to print copies of their resumes…
After the somber effect of this settled in, I asked three questions:
Why did we fail?
What should we do to prevent this?
Can I use your name?
Interestingly enough every single person said yes to the last question. By far the most common answers to the first question were of the form “My software caused us to fail”. The most common answers to the second question were of the form “We need to test it unreasonably – beat the excrement out of it”.
This was an interesting set of results – indicative, in my opinion of a great organization. People understood the gravity of what they were trying to do, they took responsibility for their own bugs, and they knew what to do to fix it, and very much wanted to.
I wrote the answers I got in a memo, and added a graph to it – since my management (at all levels) loved graphs. Now, it isn’t a graph that any mathematician would love – it was more of an illustration – but this is OK I was showing it to management, not to mathematicians. The original graph is lost to antiquity – but I’ve drawn an approximation to it below.
This graph represents the universal truth that “some things are better than other things” – and also the common truth that some things are good enough, but some things are not. There are basically two lines on the graph separated by a small margin. This would be the result of applying uniform effort across the project. Note that few things rise up to cross the “good enough” line. My proposal then, was to apply our efforts only in the areas that were below the “good enough” line, and fill in only the areas below the line – indicated by light blue in the illustration. Seems perfectly obvious. In fact, it is perfectly obvious. What wasn’t obvious was which things weren’t good enough.
Since I was still skeptical that I wouldn’t be a victim of “kill the messenger”, I went back to my 2LM and asked her if she still wanted to know what we should be doing – she said “Of course”. Since I was still concerned about shoot-the-messenger, I handed her my memo, and went home to hide where I would be hard to find. By the next day, I was dying of curiosity and went to see how my memo was taken. I looked and looked but could find no managers at all. I finally got up the nerve to ask my 2LM’s assistant where everyone was – she said they were all off-site discussing the “Robertson memo”. I was completely floored. This was a very surprising result to me. To my shock and the everlasting credit of my 2LM, she reorganized the entire project around the recommendations I had collected in my memo.
If you take staff off the good enough things, and put them on the things that aren’t good enough, then lots of normal rules on who does what get broken down. People are by definition, going to be working on areas that aren’t “theirs”. And because it’s being done over the whole project, it’s by definition not a condemnation of anyone in particular, and is thereby socially acceptable.
More Testing – The Right Kind in the Right Places
Not all the suggested things were testing, but most were. The project had good unit tests. It had good system tests. Many of the recommendations for new tests turned out to be automated and merciless subsystem tests – which were subsequently nicknamed Bamm-Bamm tests. Since I had done the interviews and written the memo, I claimed first choice on which area to work in. I chose to work on the voice play/record testing. I wrote a set of unreasonable, merciless automated tests, which set up the hardware in cross-connected mode to allow for monitoring, grabbed the voice APIs at a low level and exercised them randomly. Play, skip ahead, skip back, speed up, slow down, record, play a touch tone, etc – all randomly, and without any kind of rhyme, rhythm, or restraint. This setup also could use more voice channels than were visible on the real system – and it didn’t have to be connected to a PBX – meaning it could run without tying up so much expensive test hardware, and it could produce more load than could ever possibly exist in the real system. The first bug it found reproduced itself reliably in 5 minutes – which took at least a week to reproduce itself in the system test environment.
Once the voice subsystem could run these tests for an hour without a failure, no one could ever find any more bugs. This from a subsystem where previously it had taken weeks to reproduce a problem – even once. Like the first bug, these subsequent bugs found reproduced themselves in a few minutes. This is a huge difference. You could reproduce a problem a few times, create a fix, and try it out all in a morning – instead of a few months. For a test methodology inspired by this experience, see this article on testing distributed systems.
This kind of result was certainly dramatic, but there were many others performing similar work on other weak subsystems. This all sounds wonderful, but what was the actual result? Our 2LM decided to delay our CI by a week, to let us finish our testing and fixing. Our CI site was already a heavy user of our voice mail systems – and we were going to replace their current system with the new one – which would not look any different to them. So, they already had a culture of significant use of voice mail, and they would hit the system hard on the first Monday after the cutover.
And Then A Miracle Occurs…
We installed the system and migrated their data over to it, and then everyone sat around and watched – waiting to respond to that first crash – which we assumed would come by 10:30 their local time. But that crash didn’t come – and it didn’t come – and it still didn’t come. We were all pretty shocked. We assumed that they must have had a company holiday – so we pulled the traffic logs and compared them to their previous weeks of traffic – they were doing what they had always done – and it was just working. It was fully 6 weeks later before this heavy user of voice mail had their first complaint of any kind. And it was 6 months before our first crash in the field. Given how much new hardware and software there was, this was more than a little surprising. This reliability continued on and certainly appeared to be better than any product put out by this company (one known for reliable products). More interesting confirmation of this came years later.
For a variety of internal political reasons, there were really only two major feature releases of this product – and after that it went mostly into maintenance mode – except for recording new languages and getting certified in new countries. However, a number of years later, some of the chips on the board weren’t going to be available any more. Given the political issues surrounding the project, I assumed that would be the end of it – after all, the corporate hierarchy had tried to kill it numerous times, and it was a niche product in a small part of our portfolio. As it turns out, the product was incredibly profitable (almost obscenely profitable) – and in spite of the efforts to minimize it and kill it, had become responsible for a very significant portion of the profit of our division – and politics or no, somehow no one was willing to leave all that money on the table. Of course, because of the political issues, all the development staff had fled to places that weren’t going to kill their careers.
The Results Over Time
To keep this revenue stream going, the company had to gather up some of the original staff to create an updated version of the product. We had a new VP who didn’t know about all the political issues of the past, and got together the new team in a meeting to learn about this project. He asked “Who all is working on the product now?” One person raised his hand, then someone said, “Yeah, but he’s so good, he only works on it half time”. Of course, everyone laughed. But it was the truth – this 100-person software/hardware/firmware project which was out there in the field making tons of money had been maintained completely by one person half-time for the last 5 years. It simply worked – all the time – almost without fail. The only known bug was due to a hardware design issue relating to where some signal paths ran. About once a year one particular DSP would crash because of crosstalk in this set of signal paths. No one would let him redesign the board to fix it – so he put in a simple work around to catch it quickly and restart the DSP. This was really the only outstanding issue of any substance in the product. Of course, the redesign also performed more cost reductions as a result of 10 years of hardware evolution, so that it became even more profitable than it had been before.
When I saw the stunned look on that VP’s face when he realized that despite the laughter this wasn’t a larger in-joke – I really realized the magnitude of what we had accomplished. It was written all over his face that he really didn’t believe that it was possible for a popular and profitable product of this size to work so well that it almost never needed fixing – for more than five years. It isn’t exactly a common occurrence – “unheard of” probably isn’t too strong a term.
Although there are lots of interesting questions and reflections in the original version of this blog post, what I find fascinating is how this is proof again of the power of asking the right questions.
As we have worked to develop the Assimilation System Management Suite, we have always been determined to ask ourselves the right questions. The result is a magically easy-to-use and incredibly powerful system management solution that scales like nothing else.