Last weekend, I had the honor of giving the opening keynote on Friday at the 2015 Ohio LinuxFest and a session presentation on the Assimilation project the next day. Both talks were very well-received, but the reception the Assimilation project talk received from the standing-room audience was extraordinary. The talk was entitled “How to Discover What You Don’t Know Before It Bites You Where It Hurts”. So it seems good to give a summary of the talk and why I think they resonated so strongly to it.
Problems addressed in the Ohio LinuxFest talk
At the start of the talk I gave an overview of the problems we’re working on, along with relevant statistics, and I think that really set the stage and helped them understand how we fit into the problems they see every day. This introductory slide made the following points:
- 30% of all break-ins come through “lost” systems (Verizon)
- 90% have had failures of unmonitored services (Turnbull)
- 80% are unable to keep systems in compliance (Verizon)
- 30% start monitoring only after a problem (Turnbull)
- 30% of all systems are doing nothing useful (Koomey)
- Many sites have troubles scaling monitoring (Turnbull)
- Larger site admins often don’t know dependencies
- Documentation is incomplete, out of date, expensive to maintain
As I walked through these items, and elaborated on them, often stating that the real world was likely worse than this, the audience nodded agreement, or laughed if I put it in a humorous context. It seems that the audience agreed that these are well-recognized problems that “bite you where it hurts”.
This is evidence that there is general agreement among system administrators that these are real problems. I went on from there to talk about what the Assimilation software does to help with these problems, how it all works, and then did a demo.
Details on Problems covered in the Ohio LinuxFest
Since the audience resonated so strongly with these problems, it seems good to cover them in more detail – so here it is ;-).
30% of all break-ins come through “lost” systems
There are lots of mechanisms for “losing” systems – trial systems that never go into production, failed IT projects, services that are incompletely taken out of service, and on and on. It was quite apparent that the audience “gets” how this can happen, and that it happens. In the demo, I showed how we locate lost systems help you make this problem go away.
90% have had failures of unmonitored services
This means that these sites provide services which are not monitored, and which subsequently failed and caused problems when they did. An interesting fact about Turnbull’s survey is that his survey concentrated on users of Puppet and Chef – leading edge configuration management tools. There was resonance in the audience that these people are almost certainly better than the industry as a whole. I later demonstrated automatically monitoring services and instantly locating unmonitored services.
80% are unable to keep systems in compliance
This is from a Verizon survey about PCI compliance. In other words, these customers have eaten the elephant and gotten into compliance, but found that they didn’t stay there. I made the point that finding out that an attacker has gotten in through non-compliance, or an auditor has a finding because of non-compliance is a pretty high stress way to find out. In the demo, I showed how we automate finding non-compliance instantly – long before your attacker or auditor can find it. When you have time, and you find out quickly – the people involved know what they did, and why they did it, you can engage in rational adult conversation, and create “teachable moments” for all concerned where both sides have the opportunity to learn and improve the process.
30% start monitoring only after a problem
What this means is that 30% of those in Turnbull’s survey admitted that they only start monitoring a service after it fails. Like before, the audience agreed that the “real world” was likely worse than this. The zero-configuration automatic monitoring in the demo later on showed how we make that problem go away. No one likes this behavior, but given the realities of the system engineering job, it’s far more common than anyone is comfortable with.
30% of all systems are doing nothing useful
That is, 30% of all servers (outside of places like Google and Amazon) are either providing no services, or services which no one is using – making them zombie-like. A lot of this comes from not knowing what your servers are doing. Koomey also states that most data center operators don’t know how many servers they have much less what they’re doing. There was a lot of head-nodding and laughter when I quoted Koomey. I later demonstrated the detail which the Assimilation software has concerning your data center – including client dependencies which help you discover these kinds of “space heater” systems.
Many sites have troubles scaling monitoring
When I asked James Turnbull about this statement, he said he didn’t have specific statistics on this assertion, but he said that sites with 100 or more servers often began to have troubles scaling – network congestion and so on – often requiring proxies or other complex mechanisms for scaling. This is consistent with my conversations with people as well. Although I talked extensively about simple scaling in the talk, and the audience clearly understood the value of simple scaling, this item was accidentally omitted from this particular slide.
Larger site admins often don’t know dependencies
Again, the audience resonated to this. I certainly got the impression that they didn’t feel like they understood what their systems were doing, much less how they were related to each other through dependencies. Knowing dependencies helps you understand the impact of a change or a maintenance interval. I later gave slides showing how dependencies were discovered.
Documentation is incomplete, out of date, expensive to maintain
I have to say that this item brought by the most laughter from the audience. There seemed to be a thought that the idea that documentation could possibly be correct was funny. I followed up with my usual summary of the three types of documentation:
- Documentation you do not have – this is the most common type
- Documentation which is out of date/incorrect today
- Documentation which will be out of date tomorrow
I mentioned the usual problem of the documentation of which cables go where in a switch closet, and how mistakes get made there because of incorrect (or no) documentation. The audience identified this as a common problem. Since we keep that information correct and up-to-date, it replaces the usual manual documentation and helps eliminate that common error. In a later slide I covered how it showed up in the dependency graph as a wiredto relationship.
All in all, I was extraordinarily pleased with how the talk went. One attendee described it to me as “you owned the room”. Although I’m a good speaker and I know how system engineers think, I think most of the credit goes to having a good message about a really cool product. When you do a good job of presenting a great message to the right audience, you wind up with a great response.
Please note: I reserve the right to delete comments that are offensive or off-topic.