If you have ever watched TV, you’ve probably seen a detective show or two. The plot usually involves some kind of crime and then a team of citizens, detectives, coroners or whomever the show is about working for the next 40 minutes to figure out the who, what, when, where and why’s of the crime committed. You see these detectives gathering evidence of the crime, interviewing witnesses, seeking the advice of experts or people who had been involved in a similar incident before and then miraculously, at 52 minutes past the hour, they figure out what happened, who did and why that person did it.
I’m sure there are variances here and there but from what I can tell, this is 95% of how these shows operate.
Now, in most cases, these crimes happen even though there are proactive measures in place to help prevent crimes. Police are out on the street actively patrolling, there are cameras almost everywhere (in big cities at least) that you would assume are being monitored, there are call centers that take tips from regular people about potential crimes in progress or ‘suspicious’ looking people who are imminently going to commit crimes. Still, despite all of these measures, the Dr’s – brothers – cousins – college roommates – florist still gets killed.
If you think about it, monitoring the modern day IT environment is very similar. You have some baseline or standard monitoring that gets deployed to systems based on what operating system they are running (the usual CPU, MEM, Disk, NW type stuff I’m sure); You may have some baseline Database or Web services monitoring applied for those systems. In your eyes, everything is covered because this baseline stuff is out there and you’re sending out alerts that are likely going to someone’s junk folder.
Despite this proactive approach, the crimes still happen. Customers call in saying that an application is down. Bob in accounting can’t access the one file that is needed to ensure that payroll goes out tomorrow. Jennifer in manufacturing is freaking out because shipping labels aren’t printing which means that product isn’t being delivered to customers. Still, from a monitoring standpoint, nothing seems to be wrong. There are a few full disks and a few servers down here and there but heck, that’s all the time.
Down the hall, a team of people are reluctantly coming together to figure out why labels aren’t being printed. They don’t want to be there. They don’t know why they’re being called in. They’re disoriented and pissed off because they had to drop what they were doing and o to the dreaded ‘War Room’ to try to fix some problem that they’re sure is a result of someone else. Sooner or later, the problem is found and life is good even though your company just wasted 20 man hours of productivity.
20 hours is nothing right? Now, lets assume that the same situation plays out twice every week. We’re now up to one full headcount wasted per week. This is a big deal.
Back to our CSI theme – Even though monitoring was in place, we were still notified about a problem by a customer (911 call). Even though monitoring was in place, it was still an all hands on deck scavenger hunt to try to figure out what broke and how to fix it. Most of the people involved didn’t need to be there but because there was no real insight into the environment, everybody got the call.
You as the monitoring person in your company need to be a detective. It’s not enough to just monitor servers and some services. You need to monitor how and why these things all work together and in order to do that, you need to be a detective. You need to figure out what that server does and why it is important. You need to figure out that there is a process running with a name that you don’t recognize (ie – not part of the standard build). You need to setup a monitor for that process and figure out who to send the alerts to. You need to understand that if this particular webserver has more then 100 connections, it fails so you need to figure out how to monitor the number of connections and then alert someone when it reaches 90 (better yet, figure out a way to overcome that).
The modern day monitoring expert needs to understand how all of the different IT building blocks (networks, servers, applications, storage, databases, etc…) come together to deliver services. More importantly, you need to understand how to identify the break points of all of this interconnectivity. Depending on the culture of your company, this may be as easy as reading the release to production documentation or just asking the application owner when the monitoring request is submitted. Or, more likely, you need to do this yourself. You need to be a detective and either have the access and knowledge to gather the data or have the personality that is conducive to being able to interview someone to get the information that you need to do your job which in turn, should help them do their job.
Like it or not, you’re the IT CSI detective.
It’s a tall order. Luckily, you have vendors like NetIQ whose people are ready to help you out however they can. I’m one of them and I love this stuff.
Disclaimer: As with everything else at NetIQ Cool Solutions, this content is definitely not supported by NetIQ, so Customer Support will not be able to help you if it has any adverse effect on your environment. It just worked for at least one person, and perhaps it will be useful for you too. Be sure to test in a non-production environment.