Having worked in IT for many years, I can safely say we all agree on one thing: end-users need stuff. They need new stuff (because new stuff is shiny), they need old stuff to work (because that stuff should already work). In IT we tend to avoid talking to people, not be cause we are anti-social (although we might be), but rather we get tired of being tracked down and told about all of the generic unhappiness that exists.
This is especially true when you're the CIO, because problems that escalate to CIO generally are the really bad stuff, and often the stuff that there is no easy answer for. The worst CIO help desk call is not from the CEO (CEO calls are bad, but usually solvable fairly quickly). Nope, the worst call usually comes from the CFO. Saying something truly scary like: "The finance department can't see revenue and the sales team can't book new orders."
This happened to me a few years ago. Time is money and any disruption in that continuum comes with a price tag. The company I was working for at the time was not heavily dependent on the Internet for business in the same way as Amazon or Netflix is, but it doesn't matter. When your sales people cannot book orders and your finance folks cant see revenue….well you better find and fix that problem toute suite!
Before I get into what kicked off this domino effect of disruption, let me first say how badly I wanted a tool at that time that would've allowed me to zoom out and examine the entirety of our company's IT footprint and zero in on the root cause.
Now, back to the problem…
A member of our IT crew made minor adjustments to our database servers. These were primarily SQL Servers that handled department-level transactions for finance. These minor adjustments caused Oracle ERP to flood, and it caused a slew of other network floods, which spread out to impact our web servers, both front end and internal. It also impacted our connectivity to salesforce.com.
That's when I heard from the CFO. We pulled together an emergency team of technical folks from across the company and began the big snipe hunt.
I'm an old school IT guy, so we are at first principles here: "What was the last thing we changed?"
Because the adjustments that were made on these SQL Servers were so tiny, we had no idea they had even occurred. Those changes translated into communication problems that impacted Oracle because Oracle was sourcing files off the SQL Servers and we didn't realize that they were directly connected. A connection that was previously assumed to be insignificant, to the point it was totally ignored, turned out to have massive consequences.
Pretty quickly, we were able to establish that the problem was on the web server side. But it still took us about three hours to figure out what had happened. Once we knew, we pulled those SQL servers and reversed the changes that had set off the avalanche of problems. The problem cleared up in about 20 minutes.
This brings me back to what we do here at ExtraHop. IT infrastructure (hardware, software) is an interconnected mass of interdependent functions that all work in concert to deliver a service. For years we have heard the big monitoring companies claim that they can provide a "glass bottom boat" which allows customers to float above their entire IT environment and see everything that's going on.
Many IT monitoring technologies have tried, and nobody has really been able to deliver on that promise until now. Not only can ExtraHop see everything going on between the numerous vertical technology stacks—from server to storage to app to network to end-user—we can help you understand the interdependencies.
If I had an ExtraHop appliance at my previous company when this happened, I would have been able to look back in time, and identify who did what and when. It would have taken minutes as opposed to hours to solve.
Troubleshooting methodologies are well understood by IT today. But issues still take a lot of time to resolve. With ExtraHop installed somewhere on my network, passively observing everything as it traverses the wire, I'm able to see spikes in various types of communication from a database server to a web server or a custom app. More importantly, it allows the business to get back to business that much faster.
Perhaps even more crucial is ExtraHop's ability to help IT identify and fix problems before they get reported. Wouldn't it be great to have a dashboard that told me "Oracle is running REALLY slow right now?"
It sure beats having those annoying end-users coming up to us saying, "Hey, the internet is slow," and having to work backwards from there to make the diagnosis.
This is a companion discussion topic for the original entry at https://www.extrahop.com/community/blog/2016/you-cannot-fix-what-you-cannot-see/