The idea of "technical debt" is one of the most important things I learned from The Phoenix Project. Described as "a novel about IT, devops, and helping your business win," the book has deeply influenced the IT operations community. In the simplest terms, technical debt is the result of not doing things right in the first place. Here's Erik, a lean-methodology guru in the book, describing technical debt:
"… like financial debt, the compounding interest costs grow over time. If an organization doesn't pay down its technical debt, every calorie in the organization can be spent just paying interest, in the form of unplanned work."
As illustrated in The Phoenix Project, the accumulation of technical debt results in constant firefighting and an inability to implement new projects quickly. A less recognized yet equally damaging result is the increased waste and noise in the IT infrastructure, causing:
- Unnecessary infrastructure purchases
- Greater load on critical resources
- Low signal-to-noise ratio
- Security vulnerabilities
- More places for malware to hide
Supporting Continuous Improvement with ExtraHopExtraHop makes it easier to measure and reduce the technical debt in your IT environment. By analyzing their wire data—all L2-L7 communications, including full bi-directional payloads—IT organizations can identify waste and inefficiency in their IT infrastructure, along with the details needed to fix those problems.
Many organizations use ExtraHop to support continuous improvement environment, applying methodologies adapted from lean manufacturing. ExtraHop's Atlas Services remote analysis reports are a perfect fit for these "lean IT" efforts. IT organizations receive regular analysis across all tiers of their environment, identifying both acute and chronic issues, and then use these reports to create work items for their kanban-type scheduling systems.
By dedicating resources to paying down their technical debt—fixing misconfigurations, adjusting settings, optimizing scripts, decommissioning legacy systems, etc.—these IT organizations are freeing up capacity, increasing goodput, addressing issues proactively, and improving signal-to-noise ratios so that it is easier to spot anomalous behavior.
Real-World Examples of Paying Down Technical DebtThe examples below show how organizations are using remote analysis reports from ExtraHop to make significant improvements to their IT infrastructure.
DNSThe chart below shows a decrease in DNS errors from August to October for one organization, dropping from an 11.6 percent error rate to less than 1 percent across their entire environment! In fact, when they first started receiving Atlas reports, this organization had a 21 percent error rate for DNS requests. DNS is often taken for granted and can be a silent killer of application performance, with failed lookups adding seconds to transactions as they resolve. Yet, shockingly, it is common to see DNS error rates as high as 50 percent in IT environments.
TCPBecause ExtraHop recreates the TCP state machines for every sender and receiver in real time, the platform can understand TCP mechanisms, such as throttling. Monitoring solutions that only inspect L4 headers cannot do this. The screens below how the previously mentioned organization decreased out-of-order segments and tinygrams by 90 percent.
HTTPThe chart below comes from a different IT organization that subscribes to Atlas Services remote analysis reports. The chart shows HTTP errors—most of which were internal server errors (HTTP status code 500)—reduced by 9.5 times after problems are identified and fixed. This is a large environment with upwards of 3,000 web transactions per second at peak periods, and analyzing large amounts of data at the level of detail that ExtraHop does is no trivial task. For details on how ExtraHop does this, read our blog post, Monitoring at Scale: Questions You Should Ask Your Vendor.
DatabaseThe following chart shows an even more dramatic reduction in database errors at another organization that subscribes to the Atlas reports. After the delivery of the March report detailing the "(ORA-28000) the account is locked" errors at the database tier, the organization fixed the issue. The second graph shows that not only were errors almost eliminated, but that the variability in database processing time dropped precipitously on March 12, when the fix was implemented. Read about Oracle database monitoring with ExtraHop.
LDAPThe chart below shows a dramatic drop-off in LDAP requests and errors after an organization changed a general configuration that was causing a bad LDAP query. In this case, not only were fewer errors served, but load on the LDAP server was reduced by five times. This is a great example of how previously unnoticed waste and inefficiency in the environment can be eliminated with ExtraHop. Although the effects of the bad LDAP query may have been tolerable to users, it was causing unnecessary load and could have masked anomalous activity indicating an acute performance issue or even a brute-force attack against the Active Directory server. Read about LDAP monitoring with ExtraHop.
Make It Easier to Take the Doctor's OrdersLike taking care of our physical health, addressing the technical debt in your IT environment is easy to ignore. However, ExtraHop can make continuous improvement much easier—first by providing you with the visibility you need across all tiers, and second with our periodic remote analysis reports that can identify low-hanging fruit for optimization and tuning.
Check out the sample Atlas remote analysis report below and then visit the web page to learn more.
This is a companion discussion topic for the original entry at http://www.extrahop.com/post/blog/paying-down-technical-debt-in-your-it-infrastructure/