Ethan Lang
In this video, Ethan will show you how to create histograms in Tableau using bins. He will also cover key considerations when looking at distribution and preparing your data for modeling.
Hi, this is Ethan with Playfair+, and in today’s video, I’m going to be showing you how to create a histogram in Tableau so you can begin to understand the distribution of your data.
Understanding the distribution of your data is an extremely important step, especially once we begin to apply statistical modeling to our datasets. Within Tableau, they make it extremely easy for us to build a histogram. Let’s dive into our supplemental workbook here so I can begin to show you that. This is a histogram of profit by order. What this is doing is it’s binning our profit values into these bins, and that’s what creates this histogram. To explain that in a little bit more detail, if I were to focus on this bin, these values would say or would represent any profit in our data that was between $20 and $40 for that order.
So we can see here that there were 1,164 orders that had a profit value within that bin of data. Same thing here. This would be 565 orders with profit values ranging from $40 to $60 in profit. Now you can see that the distribution of this data set, it actually creates this nice normalized distribution for us. Normalized distribution will look very similar to this, and this is what we would call a bell curve that would represent the shape of this data. Now let’s dive in, and I’m going to show you how to create some histograms utilizing that bin feature from scratch. So let’s say instead of profit, we wanted to create a histogram that analyzed the distribution of our sales measure.
To do that, it’s very simple. I’m going to right click on our sales measure, hover over create, and then select bins. You can see here that this is going to create this menu here. It’s going to assign a field name, which is just sales bin. I’ll leave it as the default.
And then Tableau is automatically going to create what it is suggesting the bins should be. So, it’s saying we should bin our data between $0 and $507 or $507 to $1,014 and so on.
Personally, I feel like that’s probably a little too high, so I’m going to adjust this down to $100. So, this is saying I want all of our sales measures to bin between $0 and $100, $100 to $200, and so on. If for whatever reason I wanted to go back to the suggested bin, I can simply click on this, and it will reset that value back to the suggested size that Tableau has by default. Let me go ahead and change that back. There’s also the option to hook up the bin to a parameter. So, you can either create a new parameter or point it to one of the other parameters that you’ve created already. And what that will allow you to do is allow the user themselves to dictate how big that bin should be or the bin size, if you will. So, they can on the fly from Tableau Server or Tableau Cloud change the bin size and visualize that data or that distribution of the data with those new bin values. Beneath where we would set our bin size, we have this menu here that shows us our range of values, and there’s some helpful data points within this menu.
So, it’s going to show us our min value within our data. It’s also going to show us our max value within the data set. So, this would represent the min sales and the max sales within our data.
It’s going to show us the difference between those two values. So here, because we’re going from 0 to 22,638, that’s what our difference is. It’s also going to show us the distinct count of unique sales values within our data. So, you can see here there’s 3,481 different sales values within the data set. Now, I think I’ve set this at 100 and I’m going to call this good. And when I click OK, what’s going to happen is going to create that sales bin. So, we’ll see that appear here in my data pane, this new dimension called sales bin. What I’m going to do is I’m just going to simply replace sales bin and drop that right on top of profit. You can see that’s going to change the data and it’s now visualizing our sales measure based on a 100 value or 100 sized bins.
I can also see that there’s some kind of extreme values here. And as a matter of fact, I’m going to clear off this profit bin off of our filters so we can see even more extreme values.
I would consider these outliers within this particular analysis. If I were to look at this, I can see that there’s one order that had a sales value of $22,600 for that one sale, or that one order, I should say. And those extreme values, while they are important to consider, I’m going to go ahead and exclude them out of this tutorial so we can focus more on the visualization of the data itself. So, to exclude those values, I’m just going to select them and click exclude. And now you can see here we have this distribution of our data.
Now one thing that you can immediately see is this sales measure does not follow a normalized distribution. Whereas before with our profit data, we kind of had these values or the bins where they kind of ramped up. They had this bell-shaped curve where they peaked in the middle and then they started ramping down. You can see here with our sales data, however, that the majority of our orders had sales between zero and $100 and then it just slowly declines from there. As far as the number of orders, I should say, the sales values actually increase.
Now this is really important to recognize the distribution of our sales data. Once we start applying statistical models, like I mentioned earlier, some of those models within Tableau, even within the analytics pane, assume that you’re working with normalized distribution or normalized distribution of your data. We can see here with our sales measure; however, this data is not normally distributed. So, we would want to apply different statistical models that don’t have that assumption. So, we’re getting accurate results. We’re not being misled.
Jumping back over to our workbook, I want to show you a few more things on skewness. This is what we would consider being right skewed. This distribution of data is right skewed. Now there’s several different types of skewness and let me hop back to the tutorial and we’ll look at those in a little bit more detail. So, I’ve talked a lot about normalized distribution, and this is kind of the shape and visualization of the data that it would typically take if it was normally distributed. When we’re normally distributed, we call this having no skewness, and the mode would equal the mean which would equal the median. Now this is not always going to be the case. As a matter of fact, if you had a perfect equation like this where they all equaled each other, these different statistics, there’s probably something wrong with the data itself. It’d be very, very strange to see all of those equal one another perfectly.
However, they are going to be extremely close, and we would still consider that normally distributed even if they weren’t necessarily equal to one another. Now when your data is left skewed, what’s going to happen is your mean is less than the median and your median is going to be less than your mode. So, you can see that visualized here within this chart. With those statistics and the way that that would fall and be visualized, you can see that the majority of your data would kind of lean to the right side.
Now this is called left skewed, and I know that’s confusing when the majority of our data is on the right. We consider that left skewed but that’s just how it was titled and named and it’s really talking about this section over here. We’re seeing some extreme outliers that are causing this data to be skewed so that’s why they would call it a left skewed because it’s more about these outlier datas that would cause that distribution to pull to the left or excuse me be distributed to the right pull to the left. Then we can see our right skewed data. This is basically just the reversal of our left skewed where our mode is less than our median which is less than the mean and you can see that visualized here in this chart. Again, we have our right skewed. The majority of our data is on the left here and it’s skewed to the right so it’s more about the outliers over here that are causing that skewness less about the distribution or the majority of your data. That’s a little bit about normalized distribution and skewness to go along with this tutorial and you’ll also be equipped now to build histograms and start analyzing the distribution of your data utilizing Tableau’s bin feature.
Thank you for watching. This is Ethan with Playfair+.