Hey gang,

We addressed metric types in a previous TotW, but I wanted to dig into Datasets and Samplesets a little bit, what they look like, and when you should use them. We also published a video which explores some of these same concepts with ExtraHop data as well

So let's get started! To simplify our examples, let's say that we are monitoring database processing time, and we've recorded the following for several transactions

```
tprocess = [1,1,2,1,3,1,1,1,5,1,2,8,1,1,1]
```

## Dataset

A dataset is what is known as a frequency table. With every recorded processing time, we check if it already exists as a value in the table. If it does, we increment the frequency by 1. If it doesn't, we add the value to the table and give it a frequency of 1. Think of the frequency as a counter for the number of times we have seen a value. Building the dataset for our observed tprocess above, we get something that looks like the following:

```
dataset(tprocess) = [
{
'value' : 1,
'freq' : 10
},
{
'value' : 2,
'freq' : 2
},
{
'value' : 3,
'freq' : 1
},
{
'value' : 5,
'freq' : 1
},
{
'value' : 8,
'freq' : 1
}
]
```

The data structure is an array of objects, where the array is sorted by the value. If we next observed a processing time of 4, the data structure would be modified to look like:

```
dataset(tprocess'') = [
{
'value' : 1,
'freq' : 10
},
{
'value' : 2,
'freq' : 2
},
{
'value' : 3,
'freq' : 1
},
{
'value' : 4,
'freq' : 1
},
{
'value' : 5,
'freq' : 1
},
{
'value' : 8,
'freq' : 1
}
]
```

### So what advantages does a dataset offer?

The data is lossless. We can know the exact processing times observed, though not the order in which they were observed. With lossless data, we can calculate percentiles like median and 95%, even 67th percentile if we want to.

### What are the disadvantages?

The biggest disadvantage is the resource consumption of the dataset. For data which is highly clustered (eg database request processing time where we expect most calls to be sub-10ms), the frequency table will be relatively small, with large frequency values. However, if the data is widely dispersed, say for RTT for all global clients, a dataset can be extremely large, because there will be a new entry for every new RTT value.

## Samplset

A sampleset is a summarization of the observed data that can be used to calculate statistical trends in the data. A sampleset contains three data points, which together can be used to calculate the mean and an estimation of the standard deviation. If you're interested in the math behind these numbers, you can read more about the mean and standard deviation (we use a streaming algorithm) but the three data points are:

- The count, or total number of samples/data points
- The sum, a sum of all of the measured data. An example would be the total processing time for every request
- The sum2, or the sum of the squared measurements

Using the example data above, we get the following sampleset data structure:

sampleset(tprocess) = { "count": 15, "sum": 30, "sum2": 116 }

From this we can calculated the mean and the standard deviation:

```
mean(tprocess) = sum/count
= 30/15
= 2
```stddev(tprocess) = sqrt(count*sum2 - sum*sum)/count

= sqrt(15*116 - 30*30)/15

= sqrt(1740 - 900)/15

= sqrt(840)/15

= 1.93

For the math inclined, note that the true standard deviation for the data is actually 2. The error is introduced when using an algorithm to calculate a standard deviation on real-time, streaming data, but would be less significant with more data points.

If the next processing time measurement recorded is 4, the sampleset data structure would then be:

```
sampleset(tprocess'') = {
"count": 16,
"sum": 34,
"sum2": 132
}
```mean(tprocess’’) = sum/count

= 34/16

= 2.125

stddev(tprocess’’) = sqrt(count*sum2 - sum*sum)/count

= sqrt(16*132 - 34*34)/16

= sqrt(2112 - 1156)/16

= sqrt(956)/16

= 1.93

Notice that adding a new value didn't increase the size of the sampleset data structure The actual standard deviation of the data is now 1.996.

### So what advantages does a sampleset offer?

- The data is highly compressed. Each new data point does not significantly alter the size of the data structure, meaning they are ideal for use when we're identifying normal behavior across many objects, like URI processing time or RTT by client IP.
- Calculating mean and standard deviation can be more valuable than median and percentiles when looking at data that has a common tendency with random fluctuations. The standard deviation describes the spread of the fluctuations that is difficult to achieve with percentile slicing.

### What are the disadvantages?

- The data is lossy. There is no way to identify the max or min, for example, because the data is a summarization of all of the observed samples.
- If the data doesn't have a single common tendency, or is skewed (ie the data doesn't follow a normal distribution), a mean and standard deviation are not the optimal methods for identifying behavior.

## Conclusions

Hopefully you now have a better understanding of what's happening under the covers for Samplesets and Datasets.

Because of the added granularity, datasets are great for top-level metrics. They let you calculate percentiles, and can accurately summarize all types of data. However, they are costly to store and summarization is tied entirely to percentiles.

Samplesets on the other hand, are great for detail metrics. They provide an easily digestable summary of behavior with the mean and standard deviation, and consume very little system resources. However, they are not always accurate for every set of data, and don't provide the granularity of the dataset.

As you work your way through the UI, notice when and where each data type is used. When working with the APIs, whether pyhop or AI triggers, keep in mind when each data type should be used.

As always, questions and comments welcome below!