We recently had the opportunity to participate in The Briefing Room and discuss some of the new features of ExtraHop 5.0 and our Stream Analytics platform, how they impact cybersecurity, network performance, and business analytics.
ExtraHop Senior VP of Marketing Erik Giesa covered a ton of great material in his conversation with Bloor Group CEO Eric Kavanagh and analyst Mark Madsen, including:
- Why point solutions and baked-in security features are not strategically viable for protecting your enterprise in the long term. (7:40)
- Exactly how the technical architecture of ExtraHop's streaming analytics platform works, with detailed questions from Mark Madsen. (32:16)
- How processing around 9 petabytes (!) each week across the ExtraHop install base gives us deep industry knowledge that we can use to help customers avoid common performance pitfalls. (67:34)
Watch the full recording of the webinar for the full story on stream analytics, and follow along with the complete transcript below.
Full Transcript of The Briefing Room: Live Wire, The Next Dimension for Streaming Analytics
Eric Kavanagh: Ladies and gentlemen, it is four o'clock Eastern Time on a Tuesday, which means it must be time, once again, for the Briefing Room! Yes, indeed.
My name is Eric Kavanagh. I will be your humble yet excitable host of the show that is designed to get right down to brass tacks and figure out what is all the latest cool stuff in the world of enterprise technology.
The topic for today is, I think one of the most exciting we've covered all year. The exact title is "Live Wire, the Next Dimension for Streaming Analytics."
Yes, indeed, folks. There is a slide about yours truly found on Twitter @eric_kavanagh. I always try to respond and I always try to follow back.
The mission is, you know, in the briefing to reveal the essential characteristics of enterprise software. We do this by hosting live analyst briefings.
Today, you're going to hear one of my best friends in the business, Mark Madsen, interview one of my favorite vendors. I'm so excited, I can't even tell you. The topic is Innovators. January, next month, is Cloud, then analytics.
I love this. Why? This cracks me up. For those of you who have been around for a while, there was a movie and I'm afraid to admit it was 1979. "When a Stranger Called," it was a horror flick.
It was this lady watching the kids and this prank caller kept calling her, and saying terrible things. Finally, she got through to the cops, and they had traced the call, the line was: "We've traced the call, it's coming from inside the house."
I can tell you that everyone my age who saw that was scared witless when we heard that. I thought to myself it's an interesting analogy to what we're talking about the today, which is this whole concept of wire intelligence.
ExtraHop is the vendor. We'll hear from them shortly.
Wire intelligence is a fantastic concept because it deals with everything that's going across the wire. Everything going around on your network is what wire intelligence uses, and if you sit back and think about it, what doesn't go across your network at some point in time?
In a large organization, even a midsized company. Everything that's happening in your company is going over the wire at some point in time. What if you could get a crystal clear view for everything that's going around on your network? What could you do?
The first time I took a briefing from ExtraHop was several years ago, when I realized their methodology, I was floored. I thought, "This is brilliant, this is the best way yet, for understanding what's going on in your organization."
There are lots and lots of use cases you could build around this, to understand who's doing what, when, where, how, why, all this fun stuff. For a large organization especially, hugely valuable, because what are your other mechanisms of action? How else can you find out what's going on? There are lots of things you can do, you can look at this server or that server, you can look at the applications.
You've got all kinds of security baked into various components of your application architecture. You can look at any one of those little point solutions to get some ideas of what's happening.
If you can actually cap the wire and see everything that's going across, and then rebuild a vision of that network and all the applications and all the data that's flying around, that is one heck of a strategic view of enterprise data. If you think about the security breaches we've come across in the last few years.
Goodness gracious the amount of financial and other damage that's done is just borderline incomprehensible. If organizations are really concerned about maintaining the integrity of their systems, the privacy of their customer data especially sensitive data, this is the thing they should be looking at.
I look beyond that too, being an analytics guy, and I see tremendous potential for being able to leverage wire intelligence to do all kinds of things, to figure out who is doing their job, who is not doing their job, to troubleshoot.
If you think about how troubleshooting can be when servers go down, when systems go down. The price to a large organization can be massive. Getting to the bottom of that the whole root cause analysis process can be really quite problematic.
I've heard some really interesting stories from ExtraHop from McKesson is the first one that I've heard a couple of years ago where they were able to solve was a huge problem in almost no time, because they've got that strategic view of everything going across the wire.
Mark Madsen is going to be our analyst today. Of course the third nature he speaks all over the world. He's always traveling somewhere. He is well known for his understanding not just of the whole business intelligence world, but also analytics of course, and architecture.
He's really an expert in terms of information architecture. That's why I wanted to see if we can get him for this event in particular, because it's really right in his forte.
ExtraHop you see offers a real time wire data and IT operations analytics (ITOA) platform. They've added some new stuff recently, which really takes it to the next dimension hence the title today.
Their wire data analysis could be used for performance troubleshooting, security detection, optimization, and business analytics. Like I said the security is obviously huge. I love the business analytics side of this equation.
We are going to learn more about that today from Erik Giesa. He's a Senior Vice President of Marketing and Business Development at ExtraHop. He's been around for a number of years. He used to work at F5 Networks where he defined product marketing and solution strategy for all F5 products.
We are very pleased to have him on the line today. With that I'm going to hand the keys over to Erik and let him take it away. Erik, just click on that slide, and use the down arrow on your keyboard, and it's all you.
Erik Giesa: Great. Thank you very much Eric. Thanks for that great intro. Excited to be here, because what I love about this format is we get an opportunity to dig into the details.
Not just talk high level, but actually talking about the architecture and the application of the technology to solve very interesting, pervasive, real world types of problems. We'll have a use case example, taking you through this so that you really get a sense of the novel technology and approach here.
The fundamental principle, though, that drives us here at ExtraHop and I'm a technologist at heart, is that with so much dynamism in today's modern IT, and the complexity between hybrid cloud and virtualization, SDN initiatives, to even emerging opportunities seen around IoT.
The whole objective is, can we run our IT environment better in a way that is much more data driven, and not like what Eric had characterized earlier, at these point solution or point tool approaches, this notion of data driven ops?
One of the fundamental principles that we recognize is that in IT, we don't see with our eyes. We see with data, but data alone isn't sufficient. It's, "What can you do with that data, how can you present it, and how comprehensive is it?" to drive the ability to rapidly diagnose issues like Eric was indicating.
Or use the same data set to inform how we could optimize and tune those databases, for example, or come up with offload strategies, whether it's offload to a CDN or a memcache farm, or whether it's to optimize the way we're running our directory services.
This requires a degree of intelligence, insight, and comprehensiveness that you typically won't find in point solutions.
Then you take it even further, and you say, "The problem around security...," those breaches that Eric was talking about, the vast majority of them had occurred because somebody got through the most common way phishing attacks.
Once penetrated, the attackers have gotten very stealthy, and the vulnerabilities are in east west. That's where you start to see the patterns. Most security solutions out there are either consuming information via logs, which is a good source of information, but they're insufficient.
A log is systems self reported information. It's predetermined by a developer who's determining what events to log in the first place. They're not going to see everything.
The second thing is they're oftentimes disabled or manipulated in a way that they stop reporting. As a sole source of security insight, they're insufficient. In order to get a better perspective, there's an alternative that we're proposing that, like Eric indicated, sees everything and shines a light on that.
The most interesting aspect is how we see customers using our technology for real time business analytics. Like Eric said, if you think about it, the big thought is, "All business, all technology eventually transacts on the network."
Network data alone, though, is insufficient. It's all data that traverses, if you can make sense of it, tremendous power's at your fingertips to be able to see better with the data, and so we'll go through this, that principle.
When I say more than network data, applications and their payloads transact on the network. With the evolving landscape of IT and hybrid cloud, when you think about things like Web services or micro services, those are discrete elements.
Some of those things you control, some of them you're subscribing to via third parties, and we know that one Web service request, for example, could spawn 50 other dependencies requests associated with that.
These are things that, for example, you might not typically be able to log, or they might not be within your control.
If you have the means to observe them, autodiscover and classify them, you can put them into this framework of understanding how things interrelate. Their performance, and all the other contextual information, again, that Eric had mentioned up front the who, the what, the how often, the how much, the when all that contextual richness.
It doesn't stop there. It's also the client information, and when you think about IoT even. Maybe some of you aren't considering this yet, but think about this.
When you're talking about HVAC systems, physical security systems, and other smart sensors that are deployed or could be deployed throughout an environment, and whether they're in your data center, too, that are monitoring the heating and cooling, the lights, the power, all that kind of good stuff.
They could be talking different protocols to different database systems, but inevitably, they will be transacting on the wire, and if you have a means to mine that, you have a simpler way to potentially get at the data and visualize it in real time.
Then of course, business data as well, transacts on the wire via the application and payloads. It's a rich, rich, rich environment that previously had been untapped, because most vendors, when they look at the network, they architected themselves with the idea of doing continuous packet capture.
That's pretty much worthless in today's modern IT environments. Not for specific use cases for compliance purposes, you need, maybe, to capture all those packets.
You certainly can't analyze everything and put it into context by doing continuous packet capture or searches on those packet captures and using protocol analyzers. What's required is a new approach, and it's referred to as stream analytics.
There's technologies out there that look at individual streams, and might decode a specific protocol like http or Oracle TNS only, or Microsoft SQL, for example.
What we're talking about is the ability to get all unstructured packets, this network data, that comes. It could be out of order. It could be fragmented. It's multiple streams, multiple transactions in a flow. They could be asynchronous as well as synchronous, client side, server side. We just need a copy of those packets, and our real time stream processor is what transforms that in real time. We do the reassembly. We do the decoding of all the protocols out of the box.
We decode 40-plus different protocols, and we have the capability to do universal payload analysis as well, that gives you the flexibility to, in essence, analyze proprietary or custom protocols as well.
We transform that, and I've got an example of what I mean by this, into structured wire data. We give structure to all this unstructured data in motion that's traversing the network, and we can then put it into the metadata, the results set. We're not storing the packets.
We're transforming them into individual user sessions and flows. We're doing our measurements, our data extraction. We're programmable so you can extract. We also have a lot of native... over 3,000 metrics we collect out of the box. We're highly efficient and highly scalable.
To give you an example of this, in a single appliance of ours, we broke the performance barrier. We can process, analyze, up to 40 gigabits per second. That would translate into over 432 terabytes of packets a day being analyzed.
A human, obviously, can't do that, but a machine can, with the right underlying architecture. Just capturing the metadata of what's transpiring, and the auto classification, auto categorization, putting these things into the right buckets, would be meaningless unless you could transform that data into meaningful views.
We put these into three primary use category buckets, and there's the DevOps, IT operations because we're going to show you everything that's transpiring, from that first DNS request to the last byte served out of storage. What files, what application servers, what Web servers, what directory services were involved, all the way through the load bouncer, the firewall, the switches, all that kind of good stuff.
It's because we're observing, decoding, and classifying everything that we see and capture, and we do this at an alarming rate.
That's one of our unique attributes, compared to other products on the market, is this ability to do this at scale, but also the ability to do it for nearly any protocol, because that is the common denominator that all technology and businesses share.
It doesn't matter what hypervisors you happen to be running, what networking technology you're using, what cloud you're running in, what type of Web services or their constructs.
They all will share communication over the network. They're using protocols to do that, and so if you have the ability to mine that, decode those and mine those in real time, you can do powerful things.
The same is true for security. "See all, report all," is really the mantra, and looking at it from a transaction record log viewpoint, so reporting the transaction details. Again, I've got some examples of this to show you. Then we get into the business operations, and what's so cool about this is most of our customers engage with us because usually they have a performance issue. It starts with something like, "I'm using thin clients or VDI. It's Citrix. It's VMware Horizon View. It's always getting blamed. What's going on?"
Being able to decode those protocols and see what's going in the virtual channels is one of the things we do.
But the real value add on top of that is relating that to all the back end elements that a VDI administrator might not be responsible for the directory services, the storage, the databases, the actual back end applications that are being served up via those thin clients.
It's having all the richness of that viewpoint that allows you to dramatically reduce the cost of managing the environment, of maintaining the environment, and those war room type sessions, those inefficiencies that Eric talked about at the beginning.
In this single platform, it's tremendously powerful. When we see people progress, they rapidly progress beyond the proactive monitoring approach in DevOps, in bringing teams together to looking at seeing how they can leverage a platform to provide real time business analytics as well as security, so very powerful.
What we mean, why this is so powerful, and Eric brought this up at the beginning. We've heard people say, "Packets don't lie," and that certainly is true, but to get to the truth via just packets is next to impossible, simply because of the volume and the variety. How you stitch all these things together.
You need a machine to do that. That's what wire data is, is the transformation of those unstructured network packets into structured data so that you can extract meaningful elements, whether those packets...
There might be 77 different packets involved in an individual user transaction or a systems transaction. Pulling that together and extracting and measuring the meaningful elements is what makes this very, very powerful.
To put this into context, because a lot of our earlier customers used to get confused by the messaging from other vendors who might have done log data, and said, "We get insights out of logs, and we use our logs for application life cycle management and DevOps and security."
That's certainly a good source of information, but is has limitations. To rely solely on that, we believe, is a missed opportunity on the part of customers, because typical Web logs, depending upon where they're coming from, have some limitations.
Number one, they're self reported. The logs are only going to be as good as the developer who's determined what events to anticipate to log. They're not going to see everything. It's simply impossible.
The second thing is that it varies greatly depending upon the vendor themselves. Wire data has a different characteristic. It's everything observed.
Again, we're reassembling that from the data communications, all those packets into a structured format. When you compare something like a typical weblog to a Web transaction record via wire data, it's much richer.
Here's the other thing. It doesn't require any modification to applications or infrastructure to collect this. That information is traversing the wire, it's just having the means to extract it.
The way we do that, and this is where we get into some of the fun stuff and the novel approach, when we talk about a stream processor, we really mean it's a stream processor, and you can Google this and look up architectural considerations for stream processors to build one, from a computer science standpoint.
Our underlying architecture is the ability to reassemble those packets into per -user sessions or per-client and server flows and sessions and transactions, and then decode them and do the content analysis that allows us to then stream our built in metrics to our streaming data store.
What's new, and Eric alluded to this at the beginning, is we've added the ability, now to also globally or selectively record the actual transaction record details.
To stream that to a big data store, in essence, that's turnkey. It's an appliance that's integrated in our platform that allows you to search, explore, mash up, visualize and correlate these different data elements and group by things like by client, by server, by application, by URI, by whatever element...
Again, I'll have an example of this. In essence, you can think of its most simplistic fashion, think of how it's like Google Search for everything that's transacted on your network but not the packets, the actual resulting structured wire data set. We do this at scale. It's a very novel architecture, and just to do a little aside, what enabled us to do this, it's partly our pedigree.
Our two founders were guys that I worked with when I was at F5, they're the two chief architects that developed the world's first full, highly scalable application proxy that was version 9 of BIG IP TMOS.
I was there with Jessie and Raja at that time. They took somewhat similar concepts and said, "We need something that can sit out of band, be non-invasive, and make sense of all that data in motion."
That was the genesis of the underlying architecture, which make no mistake, there is no other vendor that is going this in real time and at scale for all these types of protocols.
It puts tremendous power at the fingertips of organizations. Now, to make this real, this is a typical multi tiered transaction. On the left, there, you see the behavior of the action.
It's a typical shopping cart example that's going to involve payments, order confirmation and also database queries. What you see as the output is...Actually you can see this for yourselves if you go to our online demo. I took this right from our online demo as an example. You see the top here is the rich information that could be had for real user monitoring, our real user monitoring capabilities. This isn't even the exhaustive set, but you can see that those transactions could have covered dozens of packets.
But we've reassembled for this per user, this one client, the ability to look at what URI did they hit, how much bandwidth was consumed, the end user experience, that perceived load time on that client, the browser of the client and the order confirmation. Was it considered true?
Then you look down, you see the payment processing. This key transits the network in the form of application payloads. In this case it's the orbital payment processing protocol, which is trivial for us to parse and extract the information in real time.
Then you see the database commits, and we can see the actual procedure and method used. Now think about how you can slice and dice this data. It can be used for capacity planning, performance and triage in troubleshooting. It can be used also extracted for real time business analytics. It can be used for even fraud protection, and part of a fraud and heuristics framework. We can stream any of our data set to a third party data store at the same time we're doing our analysis.
That's one of the genesis of this whole notion of IT operations analytics. Let's look at this results set. That was the raw, what you saw in the structured format, how we pre process and produce that.
Now let's look at it within the context of a real world use case. Here you see the actual UI. Dashboards are very flexible to build based upon the underlying data sets. Pick and choose what you want to visualize. Here you see we're looking at real payment transactions in real time. We can be looking at a correlating, "What's the performance," and by performance we're measuring time from that first, initial request to last bytes served to fulfill that transaction.
We're tracking the performance so we can map it against the revenue being produced, by state, by card type, et cetera. This is just an example. You could have sliced and diced and displayed any of these elements. But now, in this real world use case, we had a large payment processor as a customer of ours say, "We've been getting these complaints from some of our merchants saying that we are causing duplicate orders."
That's like trying to find a needle in a haystack. How could you? Is it pervasive? Is it happening to a lot of merchants? How much revenue is involved? How would you even begin to figure this type of thing out? With our platform and wire data analytics, it becomes almost trivial. You pivot from this view to what is, in essence, then, our ability to search. What you see here is you select the record type.
I want to search and pivot on all orbital based transactions, and now we've got this rich data set that's all been structured. Remember, we pulled this from all those raw data packets out there, discerning between orbital and HDP and DNS and PNFs and Microsoft SQL, all that good stuff, and being able to put in there a well structured format, the metadata, the important information about that transaction.
And say, "OK, let's look at what's going on here how many transactions, and what are they? I know I need to pivot off of, I want to look for, order IDs that are more than one." It becomes a simple visual query search with an X drop.
I can look and I can see here, "Oh, I have one order ID that has been processed, a unique order ID, more than once. OK, what happened? Let's pivot off of that."
I can see it happened at the same time, the same order ID, the same amount and the same merchant. It was a duplicate order, but it's not on our end, because it's happened only to this one merchant.
In three steps I went from, "How do I boil the ocean with millions of transactions and answer a very difficult question pivoting on the same common data set," to look at not only our revenue and how that's going, but am I getting duplicate orders of any one type.
Makes it very, very powerful. Keep in mind, too, this is done with zero modifications to the apps or infrastructure. It's simply mining the richest, untapped vein of data, which is your network in real time.
This would be worthless if you could only deploy it in data centers that you owned, or it was so expensive and so complicated that you couldn't put it out in your branches. We've got not only our physical appliances, our virtual appliances, we have our cloud appliances, as well for Amazon and Azure.
We support Cavium, Avaya, Hyper V, ESX, and we also have our physical appliances. They scale to all varying degrees, from 1 gigabit per second of real time analysis to all the way up to 40 gigs in a single appliance. It's meant to be plug and play. All we need is a copy of the network traffic. A network tap, a span agg capability, and we can start making sense of all that data in motion.
The end result is an architecture and a platform that allows us and our customers to make truly data-driven decisions across the board. I just gave one small example, but what I encourage people to do is to see for yourself.
We have a working version that's running in AWS that we host off of our website. Go play with our demo. Register for that and see for yourself the types of things that we can do.
It's pretty powerful. With that, I'm excited to turn it back over to Eric and Mark, because these guys have been in the industry a long, long time. They know the difference between reality and fiction, and this next generation approach. With that, Eric, back to you.
Eric K:: All right, good, and let me hand it over to Mark. I know he's got some of his usual entertaining, creative slides. Mark, I've handed you the keys, go ahead take it away, and go right into the briefing after that.
Mark Madsen: Great, thanks Eric. Hello everybody, I am Mark Madsen. I'm going to make a few comments. Skip over this part, skip over that part, and just a little bit about data collection and streaming data. We don't want to hear extra stuff. I won't go over a lot of, actually I won't go over anything they went over. I will skip over those bits. To set the context for the remainder of the webcast and the discussion about the hows and whys of pulling data off of networks rather than message queues.
The context is that most data flows on networks today. It's typically used after collection and a great deal of it is not stored, or what is stored, say the transaction that got carried in a message somewhere.
Some portions of it are in message queues, in systems. You can capture these events, you can analyze them. You can do things with them at various latencies and most non streaming systems are pulling the scope, dealing with data from minutes, to hours, to days.
The main thing is that data live in many different places simultaneously. The Web form on the application that you entered the data into.
A service call, which gets put possibly into a message in a queue of some sort, which then gets persisted into some other repository, makes it to a transaction processing system where only the transaction details are recorded and all the other ancillary bits of what was going on in and around that event are not recorded.
It's persisted in many places, in different latencies. Some are wired and some aren't. It's normal. There's a real desire in IT to try and make all of the data either go into one intergalactic message queuing system or one giant database or data platform.
Historically, neither of those things work. The giant, centralized, Death Star model, typically doesn't work. You have to spend time designing for all possible uses, and all possible structures and types of data up front. That's as true of a relational database as it is of a data cluster. It's also true of your message queuing systems. The only advantage there is that, when you want to create new types of messages, you can program them and put them out there on the bus or into the queues, into the streams, and not have a problem. One of the problems with the market today, the tools market, is that it is still thinking about things in the Death Star, intergalactic, streaming system.
We've had streaming and messaging technologies for decades now, MQSeries, TIBCO, venerable, old vendors, as well as newer things. Most of the time, today in the market, they're viewed as standalone systems.
We're not dealing with standalone systems. They're not independent, but when you look at the discussions of the architectures that are put forth, they talk about things as if this is all new and completely distinct from everything else.
You build a new Web application, customer self service portal, sensor data collection platform, whatever it might be, and you install a bunch of new tech. That lives there, and it operates, and it runs in its own infrastructure, maybe in the cloud maybe on premises, but it works by itself and it's independent.
Thing is, you want to be able to mix this data. In particular, when it comes to streaming data, there is data out there that is probably in older ESPs, in older message queuing systems.
Everything's always built around this notion that you're in a startup, or a brand new company, with no legacy infrastructure or you're building a standalone application.
Say, a real time recommendation engine that is going to pull a bunch of data from customer activity and feed next best offer recommendations to a call center or a website.
That's a nice, standalone use, but when you start going general purpose you start to run into difficulties of mismatches in technology and so forth.
The big question for me has always been, "What about the existing environment?" This is a picture of information flows in just one industry. It's actually not even an industry, it's a segment. The cheese making.
If you think about it at the corporate level, which is all the gray boxes, there's all these information flows that go back and forth. Some of them are Web forms. You log into somebody else's payables or receivables ordering system.
They could be EDI transactions. Yes, EDI transactions do still exist. Service calls, they could be Web services out there, where somebody's calling a service that you've exposed on your website. They could be emails.
All of this stuff is flowing back and forth. That's outside the company. That's not inside. You can't make this into a Kafka queue or a stream. You can't impose anything. You have to live with what those messages are.
Even inside your own company, if you start taking it apart and you look at these giant mega boxes of, say, SAP, or Oracle, or something like that.
These monolithic ERP systems, then you have custom built applications, software as a service applications living out there. You also have systems which are producing logs or taking events and publishing them out there, and not actually physically logging the files.
This is the reality of IT. It's complex, and messy, and a multitude of technologies that are both built in the new world of services, messages, and streams, and not built in that world. The stuff that's not built in that world is not neat. It's not bussy, it doesn't fit on an enterprise service bus. Most streaming tech, tools and products on the market are built, not around the notion of crusty old stuff that they have to insert themselves into, but net new.
It's really great to install Kafka, but what happens when you have MQ and TIBCO in the mix? What about stuff like SAP and people who are using, say, NetWeaver to move things around, which is even worse?
This is what has to be integrated with as you build new applications or try to tie new applications to old applications or integrate data.
My favorite quote from Douglas Adams was, "In an infinite universe, the one thing sentient life cannot afford is to have a sense of proportion." These sorts of diagrams show you the sense of proportion.
You have to focus down on your business, on your particular use case. In every business process, it's multiple applications that are communication with each other.
Typically these are older client server applications, which means there's a database behind them and they're layered applications. There's a user interface, there's some logic, and there's a database. These older systems, typically when they need to communicate, don't do it live and in real time.
They tend, often, to have integration programs written that move from the persistence layer rather than picking off the network. There's application architecture renovation that is happening in the cloud market today. The new stuff is being built with services, messages, and so forth. You don't typically have to go at the persistence layer to get data from one app to another. You make calls or push and read messages. In this world, you might at best have CDC.
That's Change Data Capture, if you're not familiar with that, which is a non intrusive way of reading transactional logs of database systems as they occur, then taking those and putting them somewhere else where they need to be. Possibly even replicating those transactions and reformatting them onto a bus.
That's one method. The other is, there's a lot of stuff, especially when we talk about streaming data, sensor data, and log based architectures. There's plenty of material on log centered approaches or log model architectures, which you can look at.
The core problem of them was that they capture, just like transactional processing systems, what the developer chose to record as significant events or data. These things are consciously programmed items. That means you are still relying on somebody to model the important data and capture that. You can never perfectly anticipate every data need or use. The only option for you as an application developer would be to log 100 percent of everything all the time, which nobody does because that would be as big as the application you're trying to build.
Then there's the other problem of these kind of approaches. You build something, you create an event, you publish it, it gets logged somewhere or it goes onto a stream and the stream captures it and batches it up. A log is nothing more than a batch of events, so logs are kind of like really slow streams.
What happens when you have to modify or change these things? Where do they come from? You've got to track down the application, inject a message. What happens when somebody wants to consume and re-publish when you add new things? There are these kinds of problems. Reusability, basically, is what I'm talking about. Then there's latency. Potentially, log files are slow, because they're basically patching up a bunch of stuff that you inject.
That's the way they are, because you're shoving the events into a streaming system. Then there's applications like SAP and so forth, that work internally by themselves. You can't control what they log.
One of the ways you can approach non intrusive data collection is to pull it off a wire, as ExtraHop was talking about. You can think about trying to capture wired data in the same way you think about Change Data Capture for databases, which is also a way of getting transactions as they occur in real time.
We're really talking about trying to mature streaming infrastructure to the point where we get the kind of reusability of data, not of code but of data that we need. It allows us to integrate legacy data and systems into the environment.
This problem of SAP or Oracle databases, or MySQL databases, built in client server fashion, is that the data still does flow across the network. You just have no mechanism to see it right now.
You could reach out there onto the network and pull that stuff without having to write batch extract programs or shove triggers into databases to do it. With that we're going to go into the actual Q&A. Sorry, I always forget about the Creative Commons piece.
Now we're going to go into the Q&A part of this, and I'd like to go back, actually, to the slide here, I'm finding it, for ExtraHop. I'd like to jump into some things.
There's a question that I have, and I think a lot of attendees have already sent in. The network data that's out there on the network, let's see, where should I start? Start on this one, yeah.
Number one, what about encrypted packets? The SSL encrypted or application encrypted packets, if you're just listening to the network, isn't that data opaque?
Erik G: Yes, that's a great question. One of the key things, and we had to solve this problem early on in our days because one of our top verticals when we first started out were a lot of e commerce companies. When we're inside the network, say in your data center or your cloud, we can decrypt the data at line rate. What we do require, though, is the private key. That works for most environments.
However, with the emergence of things like Perfect Forward Secrecy, PFS, that we obviously cannot decrypt. The way to address that is, typically people do have a proxy like Apollo Alto, Firewall Aura, Big Five IP that behaves like the client because it's in line.
Those can decrypt. Then port mirror the data off to us as well. That's one way. I know not everybody is doing PFS. There's ways to support both. For regular encryption methods, we decode that. We can decrypt and analyze the payloads and reconstruct everything.
Mark: Related to that is, with the data being encrypted, then you are decrypting it and doing something with it which we'll get to later, there are also some responsibilities for data persistence that we've already built mechanisms up for in terms of compliance and security, who has or can see data.
How do you address some of that, when we're dealing with this? Is that something that goes to somebody else?
Erik G: No. When we store data in our own streaming data store, there's a number of controls, compliance factors, that we employ to make sure that's protected.
First of all, our underlying OS itself doesn't have any route access capabilities. We've got a hardened kernel. We call it our own micro kernel. People would refer to it as an OS. It's not, in the traditional sense, but we own all that. That's proprietary.
There's no way to get in via Linux or anything like that, and worrying about keeping that up to date. The second thing is, the way we store the data is protected in and of itself. Through encryption as well as the way we hash the data.
Out of the box we don't collect any sensitive information. Out of the box it's about 4000 metrics that we'll collect. The metrics that we collect out of the box will be what you'd expect. What's the Web or HEB transaction type? How much data was exchanged? Who were the clients accessing this information?
There's nothing sensitive, natively, that we collect. When you use our application inspection triggers to extract revenue information or user name, you're going to stream that to either our own search appliance, which is based on Elasticsearch.
That information is then, obviously, sensitive. Supporting encryption at rest is a key element that we don't do natively but you can do. The second thing is, you brought this point up earlier, Mark, it's a very good point. No one system is going to have all your data. If you're doing something like using Kafka, or Hadoop, or MongoDB, or you're standing up Elasticsearch, we also have the ability to stream any of our data set, including your sensitive data, to a third part data store.
That, too, we'll encrypt in flight. Once it's stored that's up to that third party, whatever system you're doing, to make sure you're doing encryption at rest and you've got the necessary authentication controls on that. One of the cool things that we do, because we can see authentication access to all systems, a lot of our customers create a dashboard within ExtraHop that will say, "Here's my sensitive, where my PII is." "I want to see all people, all clients accessing this, what they've accessed and the logins, who got route access, who was SA, et cetera, how many attempts...."
What they do becomes part of our visibility. It's twofold. We take precautions on our own, the way we store data, but we also can monitor the environment where data lives to make sure that you're tracking to see only the people who are authorized to access it. Then, what did they do? Did that answer the question well enough?
Mark: Yeah, definitely. A question related to this and also on your slide is, you say you have a streaming data store. What's the underlying technology you're using to record the data and retrieve it?
Erik G: That's good. It's our own proprietary technology for the streaming data store. That was simply because of scale.
Traditional or legacy technologies that try to extract data from the network, all of them are based on, their underlying architecture, is based on continuous packet capture, which obviously doesn't scale and is limited in what it can analyze. They can do some protocol analysis, but they're limited that.
We had to fundamentally come up with something that would allow us to write and read from disk really fast. When we say real time, we mean real time, down to the second.
That required our own proprietary file structures, streaming data store that would allow us to seek, search, and retrieve, and write information very quickly both to disk as well as in memory. It's proprietary.
Mark: If I'm to make use of this data and feed it into other systems, do I have the ability to use, what did you call it? Application inspection triggers, to feed data to my own persistence layer or to key it into my own queuing system, like Kafka, which I might also be using for persistence?
Erik G: Excellent question. One of our fundamental philosophies is there's been a lot of vendors out there that have got novel ways to get at data, but they almost prevent people from doing it because they, in essence, charge a data tax, which discourages sharing of the data and information mashup.
We know that we're not the only source. For example, your great discussion around logs. Logs, in and of themselves, if that was your sole source, that's not, necessarily, a good idea. That's what you're conveying. You're also conveying that they are valuable, but you just don't want to rely on them. We believe that these datasets belong together, and they should be able to be set free and shared for a mashup. One of the limitations, for example, for extra hub is, if it doesn't hit the wire, we don't see it.
So getting host level view is valuable, whether it be an agent or logs. The value is getting those datasets together. We call it open data stream. We pioneered this almost three years ago where we can stream globally or a select part of our dataset to a third party, once we've done the processing.
At the same time, we're streaming it to our own data store. We could be doing multi channel streaming to...it could be MongoDB. It could be Elasticsearch. We transform the wire data to a well formatted JSON document that goes to the MongoDB or Elasticsearch.
We could stream it to a proprietary third party like Splunk, SumoLogic, or Lucono. We integrate our dataset with other platforms like AppDynamics in their agent based technology, and we don't charge for data.
It's your data. You should be able to do what you want with it, whenever you want. What we've just added, which is super cool, is we have lot of our customers asking for this, "Can you stream it to Kafka?" which then provides us all other sorts of endpoints, in which the data can be streamed and stored.
That just happened. We just announced that the beginning of November as part of our 5.0 release. There's lots of different ways to stream any part of our dataset. It could be Web transaction information.
It could be the orbital payment processing. It could be EDI. It could be any one of those things that you'd want streamed to a third party or a SIM platform for security analytics, for example, too.
Mark: One of the things that I'm interested in is pulling data off the network like some of these legacy things. Do you have the ability to translate some of the older protocols like some of the SAP, say, application protocols or the database wire protocols that go between the client driver and the database?
Erik G: Yeah. Let's start with the last one, the databases. I believe we support the widest variety right now. We can decode TNS Oracle.
We can show you in real time the method, even down to the stored procedure can be visible on a per transaction basis with no disruption to the database itself. We do that also for Cybase and DB2. We do it for obviously Microsoft SQL, post Microsoft SQL. MongoDB, and others.
That's natively out of the box. But there's a lot of proprietary protocols that we don't natively support or that isn't considered an industry standard.
It's proprietary to that vendor or internally developed by a company. We've had this for quite some time, since about version 3.7 about two years ago. We introduced the concept of universal payload analysis. We proved this out when we first did support for Microsoft MQ. We didn't support Microsoft MQ or RabbitMQ out of the box, originally.
But we developed, and customers can do this too, a parser that would allow us to extract key elements from that protocol using our universal payload analysis. What that is, it's our API into our stream processing that's programmable.
If it's TCP based we can use our customization program ability to do that. Now, it won't be as deep as what we would get if were doing it natively, but it will get you a long way there.
Mark: You said must be TCP based, what about something that's running off the UDP packet?
Erik G: I'm sorry, we can do UDP as well. Anything UDP based too, TCP, UDP based, yeah.
Mark: I'm curious with the payload analysis and being able to take a session call, maybe an Oracle session call in the database, where I don't have any streaming data, but I want to capture that data as it occurs, as transactions occur.
How do you maintain the metadata that describes the packet contents? The information I need to be able to parse and structure that?
Erik G: Good point. In our event processing, it's determined natively. When we say we support protocol like DNS, there's going to be predefined, that we've determined, our important elements like errors and process timing, record types and things of that nature that we'll record and store.
If it's an asynchronous transaction, that could be, there's a request, there's a delay, there's a response. We have a native session table, where we'll store that metadata for a time period. It's finite, because the session table can scale up to about 30,000 elements, is it's max.
It's not like you can store forever in the session table, but it's very powerful for those types of transactions where you want to correlate many different elements with that transaction, that could happen asynchronously.
Mark: Just one more question, then we'll turn it over to general Q&A. You mentioned earlier that this was an appliance, so given that I have things that are hosted or say running in Amazon or whether it's AWS or something else. How do you deal with communication of that nature, that's app to app, maybe outside?
Erik G: If it's in Amazon, we have an AMI that would run in Amazon and be monitoring. There is no central cloud path, there's no VPC tap. What we've done there in order to get access to the data, is we have what we call a software tap. What it was, it was a rewrite of RPCAP, the open source version.
We rewrote it, so it was highly optimized, we released it to GitHub, so you can get it yourself if you want. That allows us, we'd be packaged up in some kind of container or recipe if it's chef, or a manifest of its puppet. That software tap would be deployed with every instance that you're deploying.
All it does, it doesn't consume any memory or disk, it's just a super efficient packet forwarder, for that host. For the packets coming into and out of that host to ExtraHop endpoint, for correlation analysis with everything else running in there. That's how we do Amazon.
If you've got a third party service, like a SaaS app, that your end users are using, or it's hosted and ExtraHop isn't deployed over there. The visibility that will provide is from the client that's requesting that app.
Let's say, in your own environment, at your offices, we can see the time it takes, the response time, the amount of data, if we're inside that network. We can see everything coming in and out.
What we wouldn't be able to do though, is be able to correlate with the hosters, or SaaS vendors' own database application server network infrastructure unless we were also deployed there.
However, SaaS vendors are some of our biggest customers and we've got very many of the largest ones, and even a lot of the small ones that are using us for their own internal purposes. What we've added in version 5.0 is a way to expose via APIs to our platform and discreet almost to others.
We're working with a lot our SaaS vendors to say, "Hey, you know what? You can open up very discreet data feeds on a per-customer basis," for example, so that people can get a sense of the contribution to the SaaS based app correlated with their internal.
I kind of need a whiteboard for this, but the point is, is we're working on that to set the data free. Make it secure, but expose it discretely to parties that are interested.
Mark: We're at the top of the hour. I'm going to turn it back over to Mr. Kavanagh.
Eric K: Yes, indeed, a lot of good content here. There were a couple questions from the audience that have already been answered.
One, you were just kind of talking about this a second ago, but I'll throw it back to you just for some clarification, Erik. The question really is, "How are you actually pulling information off of the network?"
I think as you described, what you do is you install and appliance, the appliance taps into the network and it literally just starts siphoning off packets that are going across. You use those packets to essentially recreate a design that shows the application architecture and also the information flow. Is that right?
Erik G: That's correct, yeah. We are utterly dependent upon getting a good, high fidelity data feed packet. Either a SPAN agg system like what Arista offers as a features part of their network fabric. It could be a port mirror off of a core switch. It could even be ERSPAN which is enabled in things like VMware's vCenter, anything greater than version 5.1 has that capability to forward packets from a network segment to ExtraHop for analysis.
Eric K: Good, A couple of other really specific questions here. One Cindy is asking, "Do you have any machine learning capabilities, or predictive analytical capabilities, to help companies understand not just what's happening but what might happen in the near future?"
Erik G: We currently have trending capabilities and base lining, but that's not machine learning so I don't want to over position it. However, whoever asked that question is very smart and can anticipate what the future holds. That's about as much as I can talk about that. But yes, you can imagine we're a great platform for applying machine learning for looking at anomalies from a security standpoint, business process transactions, client behavior, all sorts of good things. Not yet, is the question, or the answer.
Eric K: Good. That's fine. Here was another interesting question, one that Cindy was asking, "Are there any data aggregators for industry bench marking of security related info that we can join?" I'll kind of throw a little bit of after context into that. One of the more interesting comments I've heard recently in the security space, is this whole trend around sharing intelligence between different corporations, certainly within certain industries, even things as simple as DNS numbers, or IP addresses, or so forth. What can you say about bench marking for industries who want to try to understand where they fit in the big picture?
Erik G: That is awesome. One of the things that we didn't discuss here, is that we have a pretty novel service that we call "Atlas" that is a way for us to assist our customers.
It's a subscription based service where we actually have human intelligence, where they work with our customers remotely, and analyze and run some of our intelligence scripts to look at chronic and acute issues and behaviors, and produce a report based on that.
It's kind of a combination of human and machine learning, but based on that we have so many customers. In fact, in any week we're probably processing up to about nine petabytes of data across our install base.
We're target rich in terms of industry benchmarks both from a performance as well as security. We have not yet exposed that to the world. That requires specific architecture and security controls and things like that. You can imagine how rich this would be.
Again, one of our mantras is, "Set the data free." Let customers do what they want with it, and if they wanted to participate in this type of sharing, and get back as a result this kind of bench marking data, we're working to enable that.
We also want to share with third parties. This is one reason why we've proven out our integration, using wire data to enrich fiberized TAP platform, for example.
DNS is a great example. That was one our forte cases, because they don't have any visibility into that. We want...That is our direction, is to share this information, extract, and also to bring into our own platform, best practices and benchmarks that we can incorporate to provide better analysis.
Yes, we're doing that partially via our Atlas services, but we want to and we plan to do that broader in the market over time.
Eric K: One of the things that I thought was so interesting about your solution too is the ability to visually navigate the information flow. I think you kind of alluded to that today, but we didn't see it. I guess it's going to be in that demo perhaps, that you referenced.
To me that was one really fascinating aspect of your tool because it allows people to A, examine, but B, really explore what's going on inside the network.
This trouble shooting side of the equation, it can be so frustrating and so complex, and so demoralizing quite frankly. What I see you all as having built, is a really powerful way to visually explore and understand what's going on in the network, right?
Erik G: You're right, yeah. That was one of the novel concepts. It's no secret that the core technology of our search is actually based on Elasticsearch. Standing up Elasticsearch, the open source capability, is it takes time and effort to maintain it.
You also have to determine how to format the records to get that data in there. What we did is, was we wrapped Elasticsearch. We optimize, too, so it would really scale up to 65,000 messages per second on a per node basis.
We wrapped it with this notion of visual query language, because people didn't want to have to understand and write these complex queries. They simply want to point and click. Something simple like, "Search on client IP." You could do a search on that. Then it will return everything that that client IP did.
Now you want to pivot to cut through the noise, and you want to be able to say, "Boy, I want to see all flows for this client IP." Then, which transactions? Then by HTTP, what things did they access by HTTP?
If I see all the protocols of that client has done, I want to do it. What were they doing with FTP? You click on FTP, and you've got the filter.
Then you can start seeing, "Oh, they not only were accessing a Russian host via FTP, they were transferring data from the environment. Here's the filename that was being transferred."
It's that simple to point and click to explore the data. I really encourage people to see for themselves, because it is almost like magic. People don't believe us when we tell them. They've just got to see it.
Eric K: [laughs] That's funny. Well folks, we do archive all these webcasts. I'm actually going to end this last Briefing Room of 2015 on a humorous note, since "Star Wars" is coming out, and Mark Madsen brought in for the concept of the Death Star and the Intergalactic nature of how tools are viewed these days.
This, I have to say, is one of the most amusing slides I've seen on the Web. It's from Twitter handle of @depresseddarth. If you go to @depresseddarth, you'll see this slide. It just cracked me up.
Folks, we want to thank you so much for your time and attention. We will give you some contact information for you to reach out to our folks at ExtraHop, and also to Mark Madsen. Someone is asking that question to get in touch with him. We will certainly do that.
With that, we're going to bid you farewell, folks. Thank you for yet another wonderful year. Year six of the Briefing Room is drawing to a close right now. Year seven begins in just a few weeks. We hope to see you then, and please share this with your friends and colleagues.
We do absolutely archive all these webcasts for later viewing, so feel free to check them out at your leisure, perhaps over the long break. With that, we'll bid you farewell, folks.
Thanks again. We'll talk to you next year, in 2016, year seven of the Briefing Room. Take care, bye bye.
This is a companion discussion topic for the original entry at https://www.extrahop.com/community/blog/2016/live-wire-the-next-dimension-for-stream-analytics-webinar/