One of the most common data cleaning issues after joining or unioning data is duplicate rows. Without deduplicating rows, you run the risk of unintentionally showing inflated numbers. Whether they are exact duplicates or similar rows that need to be condensed, Tableau Prep’s relatively new ‘Identify Duplicate Rows’ feature can remove and consolidate the duplicates with just a few clicks.

 

Using the Identify Duplicate Rows feature

Identify Duplicate Rows was added to Tableau Prep in version 2024.1. Before this release, searching for duplicate rows included complicated partition calculations, or extra aggregation steps to count rows. For this tutorial, I’ll use a modified copy of the Sample Superstore data set where some duplicate rows have been introduced. Let’s use Identify Duplicate Rows to find those copied records.

New to Playfair Data?

Watch our 90-second intro video and receive tips on making your data usable.
 
Watch Video

After connecting the data, drag the table onto the canvas. Add a Clean step by clicking the plus (+) icon next to it in the Flow pane. You’ll see a data preview and you can find the Identify Duplicate Rows button in the toolbar above the table preview.

Clean Step Header

Clicking Identify Duplicate Rows automatically generates a calculated field called ‘Is Duplicate Row’ that identifies the copies. It will show up in the Changes pane where you can track your actions. If not visible, expand the Changes pane by clicking the arrow to the left of the table preview.

Is Duplicate Row Calculation

 

Removing duplicate rows from the data

The ‘Is Duplicate Row’ calculation returns ‘Unique’ for just one copy of the row. 

One Unique Record

This means that filtering out the records flagged as duplicates will keep a single copy of each record, resolving the duplication issue. To create this filter, simply select the Duplicate value from the Is Duplicate Row field and then select Exclude.

 

Customizing the “Is Duplicate Row?” calculation

So, what if your duplicate rows aren’t exact duplicates? Let’s add the automatically generated Source Row to this data source. To add it, click the table in your Flow pane and, in the Field Preview, double-click on Source Row and choose Include Field.

This time, clicking Identify Duplicate Rows doesn’t find any copied records because each record has a distinct Source Row value.

Where have all the good dupes gone

Instead, we can select Order ID, Product ID, and Quantity before clicking Identify Duplicate Rows. This will generate a version of the Is Duplicate Rows calculation that ignores the unselected rows. This version will ignore the Source Row column and find the duplicate rows.

Receive More Free Data Tutorials Like This Every Week

Let’s stay in touch:

Playfair+Envelope

Alternatively, we can edit the calculated field directly by double-clicking it in the Changes pane. Deleting Source Row from the Partition clause makes the calculation ignore it and find duplicate rows as if that field did not exist. 

Delete Fields From Calculation to remove duplicates

If you need to keep the earliest or most recent record, you can control which row is flagged as unique by modifying the ORDERBY clause. To do this, change the field after ORDERBY and choose DESC or ASC to specify the sort order.

While it was possible to perform these tasks manually before the Identify Duplicate Rows feature, the new feature makes it a lot easier! I hope this tutorial has helped you understand how it works and the possibilities it unlocks in Tableau Prep.

Thanks for reading!
Felicia

Related Content

Ryan Sleeper

Avoid inflated numbers with this simple level of detail calculation Combining data sources with a join or union is a…