Let me start by saying that this is a true story. The names of the players and type of industry have been changed to protect the innocent, but other than that the story is pretty much as I found it when I first met the customer.
Sometimes the most mundane pieces of a machinery are overlooked, yet can have a huge affect on our business. In this case the customer was a manufacturer, who had three lines of production in the one factory up in Canada.
For those of you who don’t know, Canada gets very hot in the summer and quite cold in the winter. Just store that bit of information in the back of your mind for now – it becomes relevant later.
Back to the story – two of the production lines produced a widget every 27 seconds, the third every 50 odd seconds with each widget being worth about $4,000 in profit on the first two lines of production and roughly $7,000 in profit on the third (these numbers are estimations as the actual figures were never given to me).
The factory in question was plagued with inexplicable outages, whereby a production line would suddenly just stop due to the mainframe running some mission critical systems failing. The line would just as suddenly start up, again with little explanation.
So anyway in came the BSM consulting team to see if we could help them control this and make it more proactive. The first thing they explained was their suspicion that the Mainframe, which had to operate within a certain temperature range, was overheating. However this was hard to understand as there was an industrial HVAC system in the datacenter which was supposed to keep things cool.
With an approximate loss of $8,000 for every minute of outage I figured that this was a good place to start. I grabbed the manufacturer and model of the HVAC system and after a quick search found that not only did it have an onboard SMNP server, but that the MIB was freely available for download. I compiled and installed the MIB, set up a once a minute poll to grab the ambient temperature, created a service out of it and then set an SLO on a temperature within 5 degrees Celsius of the upper and lower operating temperature. Upon breach of the SLA, an automation kicked off that sent a pager message to the IT tech on duty that a further drop or rise in temperature would cause a production failure. Off we went to lunch while the data gathered.
Just as I was biting into my cafeteria sandwich, the duty tech’s pager went off! Ambient temperature was within 3 degrees Celsius of failure and rising.
So we ran to the datacenter and (remember that bit about Canada being hot in the summer?) found that the door had been propped open with a chair, so that some of the nice cool air from the HVAC system could be used in the office next door while the janitor was cleaning there!
So we had found our culprit (in defense of the Janitor – he had no idea of the consequences of his actions) and in under 2 hours were able to solve the big mystery.
When we presented the cause of the problem to the plants CTO, he asked how we were going to fix it.
“Easy,” I said “I’m taking an SNMP poll of your air conditioner, and have created an SLA against it for future use!”
To which he replied, somewhat perplexed, “You did WHAT to my HVAC system???”
The moral of the story is that when you are thinking about your environment, and how to protect your business, you need to take a holistic view. It’s not just hardware and software, but people and often mundane equipment too.
Business Service Management is that holistic view in a manner that all your users, including your sanitation staff, can understand (in this case a lock change on the door and a simple cardboard sign saved thousands of dollars).
So don’t overlook the small stuff – it can save or cost you plenty in both dollars and angst!