Do you ever experience unnecessary stakeholder panic or jubilation when there is volatility in your reporting trends? Learn how to use reference distributions and table calculations to determine when you should pay attention to the roller coaster of spikes and dips.
Calculated field referenced that classifies data points as Good, Bad, or Expected:
IF SUM([Sales]) < (WINDOW_AVG(SUM([Sales])) – WINDOW_STDEV(SUM([Sales]))) THEN “Bad Anomaly”
ELSEIF SUM([Sales]) > (WINDOW_AVG(SUM([Sales])) + WINDOW_STDEV(SUM([Sales]))) THEN “Good Anomaly”
Hi, this is Ryan with Playfair Data TV, and in this video, I’m going to show you a couple of different ways to do automatic anomaly detection in Tableau.
To help illustrate why this is going to be valuable for you, I’ve got this simple line graph over in Tableau Desktop that looks at just Sales versus continuous Month of Order Date.
Part of my business is I’m a consultant. So I work with a lot of different stakeholders across industries. And we typically do some type of weekly or monthly reporting. You’re very likely in a similar situation.
You’ve probably been in a meeting somewhere on a Monday showing a weekly report, and you’ll show them a graph like this, and the end user is immediately drawn to the huge spikes or the huge declines. It’s kind of human nature. That’s kind of what we get focused on. And we’re distracted from the rest of the story.
What I’m going to show you is a couple of methods to tell ourselves, as well as our end users, whether those large peaks and large valleys should be important to us.
The first way I’m going to show you is how to set up some reference distributions that show what data points are outside of two standard deviations and/or outside of one standard deviation from the mean.
This one’s very easy to set up. You just right-click on the y-axis, and click Add Reference Line.
This one is a distribution, so I’m going I’m going to click the Distribution tab. And instead of 60% and 80% of average, I’m going to flip that to Standard Deviation.
And you can see this updating in real-time in the background. We’ve now got by default here gray shading that represents plus or minus one standard deviation from the mean.
For this first reference distribution, though, I’m going to change that to be plus or minus two standard deviations. And click OK.
And because I started with the wider distribution, I can now add a secondary distribution within that first one. So if the first one’s two standard deviations, I can add a second one that is plus or minus one standard deviation.
I’m also going to change– so we can see it’s working there, but I want to change the shading for the two standard deviations, because right now those grays are laying right on top of each other. If I didn’t have those labels, I wouldn’t even be able to tell where those breaks were between one and two standard deviations.
So I’m going to edit my first reference distribution and probably change this to a lighter gray. It’s much better.
Of course, I could take a little bit more time and clean up this formatting even more. But we’re going to call that good for now just for illustration purposes.
Now, in addition to seeing the line, the gray shading represents plus or minus one standard deviation from the mean, and the lighter gray shading represents plus or minus two standard deviations from the mean.
So this already provided some statistical context to tell me which data points were most important to focus on. But now I’m going to show you a calculated field that will do some of the work for you to highlight which data points are outside of a standard deviation. It’s a technique that I really like to use. It involves a calculated field. And I’ll just call this Important Data Points.
And you’re not meant to watch me type here at Playfair Data TV. So I’m going to copy and paste this formula, but I will walk you through what this is doing. And you can also find this in the Related Content tab below this video if you want to copy and paste this yourself and use it.
But this uses a table calculation that looks at each data point. So SUM of Sales for each month. And it compares it to the Window average.
It didn’t like my minus signs either when I copied and pasted. There we go.
So it’s looking at my sum of Sales for each month comparing it to the Window average. So the mean for the entire line graph.
And then it says, if this value is less than the mean minus one standard deviation, then it’s a ‘Bad Anomaly’. If that sum of Sales number is greater than the Window average plus one standard deviation, it’s a ‘Good Anomaly’. Otherwise, it was an ‘Expected Result’. It’s within our expected range. We don’t care to focus on that as much.
So that is the entire formula. I’m going to click OK.
And now I’m going to show you kind of a tricky way to color the line graph markers based on whether it’s a positive anomaly, a negative anomaly, or an expected result.
This involves a dual-axis combination chart. To start the chart, I need to duplicate my measure on the view. So I’m going to hold down the Control key and drag Sales right next to itself.
At this point, we’ve got the same line graph with the same reference distributions on two different rows. But now that we’ve got two measures on the rows shelf, they each get their own set of Marks Cards, which can be edited independently of each other.
So for the second row, I can change the mark type to Circle, and color those marks by our newly created Important Data Points calculated field. I’m going to drag that to the Color Marks Card.
And we should see three different colors appear; one for a Good Anomaly, one for a Bad Anomaly, and one for everything else. I might remap these colors just a little bit. Maybe Good we’ll have colored blue, Bad colored red, and Expected colored gray.
I’m going to click OK.
To combine these into a dual-axis combination chart so that they’re laying right on top of each other, there’s several ways to do it. The way that most people learn is to click on the second pill on the Rows Shelf, and click Dual Axis, the third option from the bottom.
When I do that, the circles are now laying on top of the line. I’m not loving these labels. So I might go ahead and clean that up by editing my reference distribution and changing the labels to None. I’ll do the same thing for one standard deviation.
You could represent this information in a different way, either as a how to read this chart, or as a different color legend that shows what plus and minus one or two standard deviations means. For me, though, this is just kind of muddying up the view. So I’m just cleaning up these labels a little bit.
So you can see this coming together.
One last thing with dual-axis combination charts that are on the same scale, so both of my axes should be using the same y-axis. I like to ensure that those axes are synchronized, which you can do by right-clicking on either axis, and clicking Synchronize Axis.
You might have seen just a tiny shift, but that just ensures my axis on the left is perfectly lined up with my axis on the right. And now we no longer actually need the axis on the right. So I’ll right-click and deselect Show Header to get rid of that axis.
Lastly, to clean this up, I might make those circles just a little bit larger by clicking on the Size Marks Card and dragging this over to the right.
But now how I would read this and how this provided value to my end users. Here’s a good example. When we jumped from here to here, if we were showing that in a weekly meeting, we might be throwing a big party. Maybe we’re going to go home early today.
But when we provide this statistical context, we can see that that jump from there to there wasn’t actually as important as it might have seemed. Yeah, it’s great. It’s certainly better than being a negative. But it’s not even outside of one standard deviation from our expectation. So maybe let’s just kind of temper that a little bit, not get too excited.
Or the other way– and in fact, this is more the real life scenario that you probably come across. From this circle to this circle, we have this huge decline. Well, before we panic and work 80 hours this week, let’s put it into this statistical context and see that, well, yeah, we had a week over week drop, or month over month drop in this case, but it’s not the end of the world. It’s still within our expected range. So we’ll keep an eye on it, but let’s not really panic quite yet.
And now we’ve just got better context on which circles or months we should focus on, both in a positive or a negative direction. We’ve just automatically created some anomaly detection within Tableau.
This has been Ryan with Playfair Data TV – thanks for watching!