SMART Alerting for security and operations

SMART alerting blackboard

S.M.A.R.T. goals are a good idea when setting goals for yourself, or for your company. In this article I discuss how to make your operational and security IT alerting more effective and less noisy by creating SMART alerting – Specific Measurable, Actionable, Relevant and Timely. In this article we explore the idea that alerts for both cybersecurity and operational issues should be SMART alerts.

As we explore these concepts, you’ll see how these various concepts are related to each other. They are all about making sure you fix the things that need fixing and don’t waste resources when acting on alerts. The bane of anyone dealing with alerts are noisy alerts. Most noisy alerts violate one or more of the S.M.A.R.T principles.

SMART Alerting – Specific

To the degree possible, an alert should point you at a specific cause, not just a symptom. In a complex service involving dozens or hundreds of servers, it’s valuable to have your alerting indicate where to look for the cause, not just that there is a problem of some kind. The more specific the problem indication, the smaller the mean time to repair (MTTR). Back in the MVS mainframe days, there were error messages which said things like “Error in above program. Please correct all errors and resubmit job”. If the only kind of test you make of a complex service is an end-to-end user-level test, then it can be difficult to determine where the problem is. You want some high-level tests of critical services, but to speed problem resolution, you also want component-level tests of as many components as possible.

To help understand how a series of alerts are related, it’s important to know the dependencies between services. For example, was this end-to-end failure caused by the failure of a particular database? Knowing how all the piece parts fit together (that is, you want to know the dependencies) is critical to making an alerting system specific.

SMART Alerting – Measurable

Since a computer is observing the events it’s alerting on, then in some sense it’s measurable by definition. A more useful definition would be to say that you know how important the alert is.

That means you need to know what service the alert is associated with, what business process the service is associated with, the importance of the affected service to its business process, and the particular business process’ importance to the organization. For commercial enterprises, this often comes down to “how important is this business process to revenue”.

For security-related alerts the criteria are different. An alert that says that a never-before-seen process is now running is likely higher priority than one which says a system has not been patched from newly-discovered vulnerability.

SMART Alerting – Actionable

If an alert is Actionable, that means it is something which should be acted on. From the perspective of an alert, that means it should be acted on soon. If you get an alert that you don’t have any idea what you would do about it that would make a difference, then it may not be actionable. Indeed, many noisy alerts are things you can’t act on. My application is using lots of resources. Is this a problem? What would you do about it that would make a difference? The degree to which you know what to do when an alert comes in is the degree to which it is actionable. The least actionable alert is one where the action is “investigate possible causes”. An alert which says “someone enabled the telnet daemon” is one where you know exactly what to do.

Note that part of making an alert actionable is to get the alert to the right people – the ones who can do something about it. Security alerts should be sent to security teams, and service alerts should be sent to the service management teams. Typically the more specific you can be

SMART Alerting – Relevant

Alerting systems can be noisy – that is, they can give indications of things which aren’t actually problems. Performance measures often fall into this category – alerting on things like CPU busy is usually meaningless. You don’t want your alerting to be guilty of “crying wolf”. Measures like length of run queue are much better indications. Even there, my experience says that common guidelines like “load average should be less than 0.7” are too noisy for alerting. Long-term trend analysis is a different problem from alerting. You want to look at lots of things for trend analysis for capacity management. But capacity analysis should happen periodically – it not cause alerts.

A friend of mine puts it this way: “There are only two speeds of computers – fast enough and not fast enough”. Quantifying what’s fast enough can be difficult. You want a measure which defines “not fast enough” for the service at hand.

Security alerts have a reputation of being noisy. One of the big challenges in security is automate as much as possible and to filter out the relevant from the noise. You don’t want anything important to be missed, but given the increasing pressure from adversaries, you can’t be chasing irrelevant things. Tools which indicate significant anomalies with minimal or zero noise are well worth having.

Smart Alerting – Timely

A timely alert is one that is given fast enough so that the actions you want to take to respond to it happen fast enough for the probable importance of the problem. Some problems are important for reasons related to the organization’s goals, and some are important because of past history or political considerations. Regardless of the underlying reasons, it’s important to give alerts at the right time.

Some kinds of alerts are relevant at all hours of the day because of the critical nature of the issue or service involved. By contrast, some alerts should only be given during normal office hours. If you get a middle-of-the-night wake up call for something of minor importance, then you’re likely to not be very happy.

Another factor that comes into play is what is the process used to investigate and repair the issue. If an event will handled by in a committee meeting, then a more leisurely approach to alerting is called for.

Of course, it’s easier to give the right alerts at the right time if the events detected are specific and measurable.

Summary

We’ve given a whirl to the idea that like goals, alerts should be specific, measurable, actionable, and timely. What’s your take on this? How have you seen alerts that failed the SMART tests?

If you want to look into a system which knows dependency information, and whose security alerts are very SMART, then naturally you should look into the Assimilation System Management Suite ;-).

Please note: I reserve the right to delete comments that are offensive or off-topic.

Leave a Reply

You have to agree to the comment policy.

This site uses Akismet to reduce spam. Learn how your comment data is processed.