This tip of the week is going to look some low level details in the Domain Name System (DNS). What are SERVFAIL and NXDomain messages? What are truncated DNS errors? How do these three conditions contribute to the slowness of applications, desktops, servers and almost anything using TCP/IP networking?
Understanding these three critical metrics in DNS, SERVFAIL, DNS NXDOMAIN and truncated messages will be critical in improving response times across your entire infrastructure.
First let’s understand exactly what the two error messages mean and then we will look at truncation.
SERVFAIL messages show that the fully qualified domain name (FQDN) that has been looked up does exist, that the root name servers have information on the domain but that the authoritative name servers are not answering queries for this domain. For example, you have an application server that is trying to make a call to a public API to lookup some information. The application queries for api.example.com, there is a SERVFAIL error thrown back and the call fails. Depending on how the application is written you may or may not see the information or be able to react to it, you may never know that the call is failing or it may take a long time to understand that it’s DNS that is causing the failure.
NXDOMAIN can mean that the root name servers are not providing any authoritative name servers for this domain. This can be because the domain does not exist or that the domain has expired and been put on hold. These types of errors are not only fatal for applications and clients trying to connect but they can be extremely hard to diagnose. Think about it this way, for a web browser, if you try to get to a site that does not exist you may get an error such as “DNS_PROBE_FINISHED_NXDOMAIN”, you will check your spelling or the address and fix the issue. Now think about applications or servers running in your environment, if they are receiving NXDOMAIN because some domain they are trying to get to is expired you may or may not know this. Is the application keeping detailed logs? What if it’s not a critical service but a side service that may cause an application degradation but not an outage, you may not catch the error until a customer reports it or until it’s fixed by someone else.
So when you looking at ExtraHop DNS dashboards and see SERVFAIL or NXDOMAIN errors, you know that devices in your environment are having issues. And when you trace down the devices using ExtraHop the type of error will help you solve the issue.
For SERVFAIL errors you need to track down the authoritative name servers and find out why they aren’t responding.
For NXDOMAIN you need to track down the registrar with a tool such as “whois” and find out why the domain is no longer available.
The dashboards not only tell you there’s a problem but also give a big hint on how to solve the issues.
Let’s look at one last area of concern for DNS, truncation. In a lot of ways this is the most interesting of the DNS errors. With DNS truncation the query can complete and services can connect to their intended destination without anyone noticing any issues. However, if you are seeing DNS Truncated messages, you are running into a situation that can be far more insidious. Truncation is a situation where UDP was unable to service the DNS request, it was truncated, and the request had to be serviced by TCP.
There are two possible things to investigate. First, if the DNS server is configured properly to actually serve TCP DNS responses. If it's not, fallback should be enabled and allowed, check to make sure TCP DNS ports are not blocked and remediate the configuration as necessary to eliminate the errors. However, it's also worth tracking if you have a lot of DNS truncation, especially from applications, you may be causing delays for the applications while they ask for DNS first by UDP and then by TCP. This can be investigated and corrected through using local host files, changing the query itself or just forcing TCP.