Ryan Sleeper
Combining data sources with a join or union is a powerful way to add context to your analyses but can lead to duplicate rows and incorrect results. Learn how to overcome the pitfalls of combining data with a simple calculated field.
Hi. This is Ryan with Playfair Data TV. In this video, I’m going to show you how to deduplicate joined rows in Tableau. Combining data sources in Tableau by creating either a join or a union is a powerful way to add context to your analyses. But unless you’re careful, it can lead to some unexpected and, as you’re about to see, very inflated results.
So I’ve got this tip for you to help you dedupe joined rows with a very simple calculated field in hopes that you don’t get caught in a meeting reporting an overinflated number. So over here in Tableau Desktop, let me first show you why this is needed. On another video, I show you some of the first things I like to do when working with a new data source, and one of the very first things I like to do is view the number of records.
We’re going to do that now just to take note of how many rows are in the Sample Superstore data set. So I’m going to start by dragging the Number of Records measure to the Text Marks Card. And you should see that there are 9,994 rows in this data source that we all know and love.
Take note of that number. And let’s also look at one of the sums of one of the metrics. I’m just going to double-click on Sales. And the default aggregation is SUM so that we see that there is approximately $2.3 million worth of sales in the Sample Superstore data set. The most common way that I like to join or combine data sources when I’m using them in Tableau is to start with a primary data source. And what normally happens in a real world situation is I realize that I need additional fields that live in another database table. So I will do a left join to that primary table.
I think of it as bolting on additional fields or additional context. So we’re just appending extra columns to our primary data source. That requires a left join in order to do that. I’m going to clear our slate here and show you an example of doing a left join in Tableau. I’m going to go to the Data Source tab and left-click and drag the Returns table right next to the Orders table.
That Venn diagram appears, showing me it’s going to do a join on those two tables. The default type of join is an inner join. That’s going to keep only the records that are in common across both of the tables. But in the scenario that I just described, I want to click on that Venn diagram and choose a left join instead of the default, inner join.
As you can see, the field that is in common across both of these tables is the field called Order ID. So this type of join is going to keep every single row from my primary data source that I’ve been working with from the Orders table, and it’s going to bolt on or append a new column that tells me whether or not the order was returned from the other table. I’m now going to jump back into the Authoring interface. And as long as you quality check this, you can’t really get caught reporting an unexpected number. So I highly encourage you to quality check your numbers.
So I once again will drag Number of Records to the Text Marks Card. And now we no longer see that 9,994 number. We see 12,420. This happens any time there are multiple matches on the right side. It can inflate the numbers. It’s going to duplicate certain rows in order to create the join in your table.
If I were to look at this underlying data, I’m sure we could find an example of this. So I’m going to look across this to the Returns column. And let me find one here. So the Newell 341– because that had an order ID three times in my Returns table, it tripled it on that primary table on the left. The reason that is problematic is, if we look at the sales number of $9, that just got inflated times 3. So we’re up to $27. We’re no longer reporting the correct answer.
We quality checked the SUM of Sales number as well. Let’s do that and see if we’re still around that $2.3 million number. We are not. See, it has been inflated to approximately $2.9 million. So we’ve just added $600,000, not due to our performance, but simply because of this join.
So that’s why this is an issue. But I’m going to show you an easy way to fix it. You can aggregate this next calculated field I’m about to show you at any level you would like. But it requires a level of detail calculation to deduplicate those cases where there are multiple matches on the right side.
You can do this, like I said, at any level you would like. But it’s easiest to accomplish in this scenario if you have a unique row identifier in your data set. So on the primary table with 9,994 rows, if we had a unique identifier, we could dedupe the sales values per unique row. And that would return us to the correct sales amount.
Well, this is a little-known fact, but the Sample Superstore data set actually does have unique row identifiers as a dimension. However, by default, they are hidden. So most people have never seen this row ID. But if you click on this down arrow in the top-right corner of the Dimensions area, you can show which fields are being hidden.
If I click on that, we will see a new one appear called Row ID. The fields that are grayed out are hidden at the moment, and that includes Row ID. If I right-click on any of those hidden fields and choose Unhide, that dimension will now be available to me to use in a calculated field.
To add this to a calculated field, if I know I want to include it as an element within that calculated field, I can right-click directly on it, hover over Create, and click Calculated Field. And I will call this one Deduplicated Sales.
And as I mentioned, this is a level of detail expression. It starts with an open curly bracket. It’s FIXED because we want to do this across the entire file per Row ID. So I’m choosing FIXED instead of the EXCLUDE or INCLUDE.
What comes next is a colon punctuation mark. What comes next is the aggregation. You can use minimum, maximum, or average here. But let’s go back to that Newell product example where we had three rows. If we took– so there’s $9 of sales, $9, $9– three times. If we took the MIN of that, it would be 9. The MAX of that would be 9. Average of that would be 9. So you can choose whichever one you would like.
My brain thinks about this as minimum, so I’ll do minimum as the aggregation, and then my measure. Close parentheses, close curly bracket. And that’s it for now. If you’re new to level of detail expressions, we do have several videos here on the Advanced track covering how to write this syntax. But this is just one of infinite applications of the level of detail syntax.
I’m going to click OK and replace SUM of Sales with my newly created version, Deduplicated Sales. And now we are back to that number we were expecting, $2.3 million. Now that I trust the data source and I trust this newly created calculated field that basically gets me back to the same Sales measure from the left, I’m comfortable with rolling this out and using it in my business.
So a quick trick, just to show you how to deduplicate combined rows in a Tableau data source, and the main reason for this is it’s going to help you avoid over inflated numbers.
This has been Ryan with Playfair Data TV – thanks for watching!