In this tutorial, Ethan will explain how to implement the Median Absolute Deviation equation in Tableau. This robust method is used to detect outliers when your data does not follow a normal distribution. The tactics he shows you can be used to not only detect outliers but, more importantly, he will show you how to display them visually using Tableau.
Hi, this is Ethan with Playfair+, and in today’s video I’m going to be walking you through how to calculate the median absolute deviation and utilizing that statistic and technique to detect outliers within a non-normalized distribution of data.
Now, if you recall back to, you know, looking at histograms and kind of analyzing your data, it’s really important to understand the distribution of your data because if you’re applying the wrong technique, the results you’re getting might not be significant, so you might be getting misled.
Utilizing this technique, it doesn’t have the assumption of a normalized distribution of data, so we can apply this technique to non-normalized data.
Before we jump into Tableau and I really start going into kind of the tactics on how to implement it, I do want to just open up the documentation that I wrote in our tutorial on our blog to kind of talk you through the equation itself. I think it’s very important to understand that equation because then you’ll see how those functions and how it’s being calculated in Tableau kind of aligned with that equation.
So let’s jump over to the blog, and I’m going to show you now, let’s jump to this part here where we’re going over the equation. So this is the actual equation to calculate out what I’m going to refer to the rest of this video as MAD, our Median Absolute Deviation, our MAD statistic.
And the way we calculate this is we find the median and then we take the absolute value of our median minus the observation itself, take the absolute value of that and then multiply it by our median.
So let’s break that down into three simple steps.
So let’s consider this as our data set. It’s five observations here. We have 5, 10, 12, 15 and 18.
So the first step that we need to do to calculate MAD utilizing this sample data set is we need to find the median.
Just as a refresher, the median is going to be kind of the middle value within a data set. So since this is an odd number of observations in our data, it’s very easy to calculate. The value is 12. So our median of this data set is 12. That’s the first step. Very easy.
The second step is we need to take that median and subtract it from each observation. So if we look here, we have 5 minus 12. That’s going to get us negative 7.
Then we take the absolute value of that and that returns positive 7. So the absolute value 7.
Then we do that for each observation within our data set. 10 minus 12, negative 2 takes the absolute value.
It’s a 2. 12 minus 12, that’s 0. Take the absolute value, that’s 0. 15 minus 12, that’s 3. Absolute value is still 3. And so on. So we do that for each observation.
And what that’s going to lead to is it’s going to give us kind of a new data set if you think about it this way. We’re looking at now the absolute deviations from the median.
So we have this data set. I’m going to rearrange it in ascending order and we’re going to find the median within those values.
So again, the median, this is an odd number of observations.
Where the median of this is 3. So that is our median of the absolute deviation. Our MAD statistic is 3 in this example.
Now to take that the next step further, we take, just recalling back to our equation here, we’ve now calculated out the MAD. Now we need to multiply that by the median. And that’s going to give us kind of our upper and lower bounds.
We also have this statistic in here that we’ll use, which is just a constant of 3.
And I’ll explain why we need that once we jump into Tableau. But for now, we’ve calculated out by hand our median absolute deviation, the MAD statistic, it’s 3.
And from that value, we could then calculate out the lower and upper bounds. Anything that would fall outside of those lower and upper bounds, that would be considered an outlier.
So now that we have a firm understanding of what the median absolute deviation is and how we can calculate it, I’m going to jump over to Tableau. One thing that you should understand, yes, we did calculate this by hand. It is extremely easy to do and easy to understand. But when we’re working with a large data set, calculating out millions of observations and then trying to do that by hand is just not feasible. That’s where we can utilize a tool like Tableau. And more importantly, once we calculate it out, we can then visualize it utilizing the software. So let’s jump over to Tableau.
Okay, now that we have a firm understanding of that equation, we’re going to jump over to the supplemental workbook. This is the workbook that we stood up for our median absolute deviation, the tutorial that we just looked at. You can download this on our Tableau public. It’s available to download and you can re-engineer what I’ve done here. But I’ll be walking you through each step in this tutorial or this video tutorial so you have an understanding of how to implement it as well.
One thing I wanted to show you is just how robust this calculation is. So you can see here I can filter between these different segments and that median absolute deviation is actually going to recalculate based on those, that new filter data and show me the outliers, again, looking at those different segments.
So it’s extremely robust the way it’s implemented and I’ll be going into the details on why that is as we jump in.
So let’s drill into the actual sheet and what I’m going to do is I’m actually going to duplicate the sheet and I’m going to remove all this formatting. So we kind of have a blank slate here.
Okay, so let’s drill into the sheet itself and if we were to start by scratch, we would have this line chart. You can see I’ve implemented a dual axis line chart because I want to implement some conditional formatting on these circles here later.
But first let’s jump in and try to understand the median absolute deviation. So if you recall back to the equation in the tutorial, our first step is to calculate out the median and that’s extremely easy to do within Tableau. If you jump to the analytics pane, you can drag in our median with quartiles. If you drag that to table, it’s going to add this and then we can just remove the median
of the quartiles and we are left with the median.
So this is the median of our data set. It calculates it out for us on the fly. So as this data is filtered, again, this median will recalculate and that’s kind of foundational for this median absolute deviation technique to work is this kind of dynamic statistical analysis where it’s going to recalculate and show you those new outliers with whatever data you filtered.
So that’s our first step, calculate out the median. The next step, if you recall, is to calculate out the median absolute deviation.
So I have that created already. If we edit this calculation, which I’ve labeled MAD, and let me zoom in here on the calc itself, what I’m doing is I’m using table calculations. So again, this remains very robust where you can kind of update it on the fly and it will update dynamically.
And what I’ve done is I’ve taken our sum of sales minus the median, the window median of sum of sales. So this is going to take each observation, if you recall back to the equation, that’s our X1, X2, X3, etc. And it’s going to subtract the window median of each observation.
Then it’s going to take the absolute value of each of those, and that’s going to produce, if you recall back, that kind of new data set. So this would be the absolute values or the median absolute deviations of each observation.
And then it’s going to find the window median of those calculated, excuse me, of the product of that equation.
So we have our absolute value, it’s going to find the median absolute deviation from the median within the data on the fly because we’re using these table calculations.
So again, if we break that down step by step, we would have something similar to this.
We’re taking and we’re finding the absolute value of each observation minus the window median, and then we’re getting the window median of those absolute values from the median.
It’s a tongue twister for sure, but this is how we would calculate out the median absolute deviation within Tableau in a very dynamic and robust way so that utilizing these table calculations, as the data changes or we add filters, it’s going to just recalculate utilizing the table calculations.
So that was step two and three, bucketed into one calculation.
Now once we have our median absolute deviation calculated, the next step is to find those upper and lower bounds. So let’s take a look at our upper bound here. I’m just going to edit this calculation and let me zoom in.
And again, if you recall back to our documentation, what we’re doing is we’re taking the median absolute deviation, we’re finding the median of each observation or the total dataset,
adding three, this is our K constant that I mentioned earlier, and we’re doing that because it does have something to do with the standard deviations of the data. And because we want it to be at a higher level, if you recall back to trying to find outliers utilizing the standard deviation, we want to just utilize standard deviation of one. If we wanted it to be very robust and identify outliers that are pretty far outside of our standard deviation, we need to improve that and jump it up to standard deviation of two or three standard deviations from the average.
Here we’re utilizing the median and this K constant is essentially that multiplier.
This is going to show us the outliers that are pretty far away from the median.
So what we’re doing is we’re finding the median of our dataset and you can see here we’ve already put that out in here utilizing a reference line, adding three to it, and then we’re multiplying that by our median, or excuse me, by our MAD calculation.
And what that’s going to give us is it’s going to give us an upper and lower bound. Excuse me, let me explain the difference of the upper and lower bound. The upper bound is going to add in our K constant here, which is three.
Our lower bound equation is going to subtract that three. So if I zoom in here, our K constant of three is being subtracted in the other calculation just to be clear. That K constant is being added. So let me just double jump back over to our upper bounds and you can see I’m adding the K constant here. So those are the difference between the upper and lower bound.
And our next step is I just need to put these calculations onto the detail property of our marks card.
So because I’m utilizing a dual axis here, I’m going to make sure I’m on the all and then add these calculations or calculated fields to our detail properties here. That’s going to add it to both the line chart and the circle graph, or the dot plot, excuse me.
Now that those are in detail, we can start utilizing them and implementing them on our view.
So if I right click on the axis here, I can jump to add reference line.
And I want to add that reference line across the table.
So I’m going to select entire table here.
And then from our selection, what I actually want to the line to calculate on, I can choose a value. And now that I’ve added those two calculations here, you can see I can select them as values.
Another thing I just want to call out, you know, they have this delta symbol next to each calculation that represents that those are being calculated and driven off of a table calculation.
It’s important to understand that because it is, again, one of the mechanisms that makes this so robust. But there’s also limitations to that as well that you’ll have to consider even within the order of operations of filtering, you’ll have to consider that.
So I’ve selected that MAD upper bound calculation and currently it’s selected to the average.
I would recommend just choosing median. You can see here from the visualization. It did update, but it really didn’t move that line. But just so we’re calculating everything on the same level of detail, I would recommend that. I’m also going to change the label here. So I don’t want it to just be this computation.
I’m going to put a custom and I’m going to say MAD upper bound.
And I’m going to turn our tool tip. I’ll just leave that to automatic. Let me see. I think I’m done. We could do some other formatting here if you wanted to. You could bold these lines up a bit so you can see them a little better.
Maybe we’re going to make the line darker.
You can do all those things directly from this interface. But for now, I’ll call that good and I’ll just click OK. And now you can see that upper bound absolute deviation is now showing up.
So anything that would fall above the upper bound would be considered an outlier within this dataset.
Now let’s add the lower bound. So essentially we’re going to do the exact same thing here. I’ll select entire table. I’m going to select our MAD lower, change this to median, and I’m going to put a custom label on here and just call it our MAD lower bound.
Click OK. Actually, one more step.
I’m going to change that label to MAD lower bound. And just so we’re on the same page, I did bump the formatting up on the line. I’m just going to hop back, do that real quick here as well. So now I’ve implemented our MAD upper bound and our MAD lower bound calculated fields.
You can see anything that falls inside this we would consider not an outlier. Anything that falls outside of either of those bounds, we would consider an outlier. So we can see we have two observations here that fall above that upper bound value and zero observations that fall below this lower bound value.
Now as you saw when I first shared the screen, you can also implement some conditional formatting.
I will preview that with you here.
It’s very similar to what we would always do for conditional formatting. It’s just an if else statement where we’re looking at the SUM of sales and we’re seeing if it’s greater than that MAD upper, then we’re going to categorize that as above. And we would consider that an outlier.
Else we’re going to look if it’s lower than that MAD lower and if it does or if that value falls below the MAD lower bound, we would categorize it as below. Anything else is expected. So this is just regular data. It’s not an outlier.
Utilizing this technique at least.
So I’m going to select OK and I’m going to add that to this dot plot here, the circle chart. I’m just going to put that on the color property and you can see here we have above, we have expected values and it’s coloring properly. These that fall above that upper bound are now this gold color and then everything else is just our play for navy, navy blue.
There’s some other techniques you can implement where you can add in some banding that would show you kind of, you know, that would gray this out.
I’ll definitely, you know, allow you free thinking to implement however you want as far as the formatting.
The end of the day, this is the technique.
Like I mentioned earlier because we’re using these table calculations here, once we start filtering or adding filters, all of those techniques are going to automatically recalculate. So it’s very dynamic, very robust technique that you can implement.
That was calculating out the median absolute deviation utilizing Tableau to detect outliers and non-normalized data. This has been Ethan with Playfair+. Thanks for watching.