In this tutorial we will cover why it is important to remove duplicate values, an example and a step-by-step example using SQL.
A common task in data preparation is to handle duplicate observations. These can be rows that are exact duplicates or they can duplicate only a subset of the columns. In either case, it is often necessary to deduplicate the data and return a single row.
It is easy to deduplicate data in Excel. However, often, the data exists in a data warehouse and needs to be extracted to the Excel spreadsheet before analysis can begin. At best, this is an added step and wastes time downloading the data. At worst, there is too much data for Excel to handle. When this happens, our normal solution is to filter the data, but filtering will miss duplicate data in the data warehouse. In either case, it is easy to perform these imputations directly in the data warehouse and just return the result.
Before deleting rows from a table, we often want to identify those rows that contain duplicates. Consider a table SAMPLE_TABLE with five columns, column_a, column_b, column_c, column_d, column_e. We can identify those rows that are complete duplicates by running
If instead, we wanted any duplicates of just column_a and column_b regardless of the values in the rest of the columns, we can run
Now that we have identified which combinations of columns have duplicates, we are ready to delete all but one of the duplicate rows. We use common table expressions (CTEs) to calculate a row number within each group of duplicate rows using the SQL function ROW_NUMBER over a WINDOW defining the group as
Then we can delete all records where rownumber is not equal to 1.
Note that because a CTE contains references to the relevant rows in SAMPLE_TABLE, deleting a row from the CTE also deletes it from the underlying table as well.
Using ROW_NUMBER over a WINDOW allows even more control over which row to keep. Assume that we want to keep only a single row for each combination of column_a and column_b and we have a sixth column, transaction_date. To keep the first transaction in each set of duplicates, we can simply add ORDER BY transaction_date to the WINDOW as
Alternatively, if we wanted to keep the last transaction, we would run
SQL makes it easy to find duplicate rows using GROUP BY and HAVING in SELECT statements. By combining the power of CTEs, the ROW_NUMBER function over a WINDOW, we can just as easily keep only a single row out of each set of duplicates.
By following the steps above you have combined the power of removing duplicate values and SQL to transform data directly in your cloud data warehouse. Please feel free to reach out to a member of our team if you have any additional questions.