In this tutorial we will cover why it is important to remove duplicate values, an example and a step-by-step example using SQL.

Why it is important to remove duplicate rows?

A common task in data preparation is to handle duplicate observations. These can be rows that are exact duplicates or they can duplicate only a subset of the columns. In either case, it is often necessary to deduplicate the data and return a single row.

SQL over Excel:

It is easy to deduplicate data in Excel. However, often, the data exists in a data warehouse and needs to be extracted to the Excel spreadsheet before analysis can begin. At best, this is an added step and wastes time downloading the data. At worst, there is too much data for Excel to handle. When this happens, our normal solution is to filter the data, but filtering will miss duplicate data in the data warehouse. In either case, it is easy to perform these imputations directly in the data warehouse and just return the result.

In SQL

Before deleting rows from a table, we often want to identify those rows that contain duplicates. Consider a table SAMPLE_TABLE with five columns, column_a, column_b, column_c, column_d, column_e. We can identify those rows that are complete duplicates by running


SELECT column_a, column_b, column_c, column_d, column_e, COUNT(*) AS group_count
FROM SAMPLE_TABLE
GROUP BY column_a, column_b, column_c, column_d, column_e
HAVING COUNT(*)>1

If instead, we wanted any duplicates of just column_a and column_b regardless of the values in the rest of the columns, we can run


SELECT column_a, column_b, COUNT(*) AS group_count
FROM SAMPLE_TABLE
GROUP BY column_a, column_b
HAVING COUNT(*)>1

Now that we have identified which combinations of columns have duplicates, we are ready to delete all but one of the duplicate rows. We use common table expressions (CTEs) to calculate a row number within each group of duplicate rows using the SQL function ROW_NUMBER over a WINDOW defining the group as


SELECT column_a, column_b, column_c, column_d, column_e, 
ROW_NUMBER() OVER (PARTITION BY column_a, column_b, column_c, column_d, column_e) AS rownumber
FROM SAMPLE_TABLE

Then we can delete all records where rownumber is not equal to 1.


WITH ROW_DELETE AS 
(
SELECT column_a, column_b, column_c, column_d, column_e, 
ROW_NUMBER() OVER (PARTITION BY column_a, column_b, column_c, column_d, column_e) AS rownumber
FROM SAMPLE_TABLE
)
DELETE FROM ROW_DELETE WHERE rownumber <> 1

Note that because a CTE contains references to the relevant rows in SAMPLE_TABLE, deleting a row from the CTE also deletes it from the underlying table as well.

Using ROW_NUMBER over a WINDOW allows even more control over which row to keep. Assume that we want to keep only a single row for each combination of column_a and column_b and we have a sixth column, transaction_date. To keep the first transaction in each set of duplicates, we can simply add ORDER BY transaction_date to the WINDOW as


WITH ROW_DELETE AS 
(
SELECT column_a, column_b, column_c, column_d, column_e, transaction_date,
ROW_NUMBER() OVER (PARTITION BY column_a, column_b ORDER BY transaction_date) AS rownumber
FROM SAMPLE_TABLE
)
DELETE FROM ROW_DELETE WHERE rownumber <> 1

Alternatively, if we wanted to keep the last transaction, we would run


WITH ROW_DELETE AS 
(
SELECT column_a, column_b, column_c, column_d, column_e, transaction_date,
ROW_NUMBER() OVER (PARTITION BY column_a, column_b ORDER BY transaction_date DESC) AS rownumber
FROM SAMPLE_TABLE
)
DELETE FROM ROW_DELETE WHERE rownumber <> 1

Conclusion:

SQL makes it easy to find duplicate rows using GROUP BY and HAVING in SELECT statements. By combining the power of CTEs, the ROW_NUMBER function over a WINDOW, we can just as easily keep only a single row out of each set of duplicates.

By following the steps above you have combined the power of removing duplicate values and SQL to transform data directly in your cloud data warehouse. Please feel free to reach out to a member of our team if you have any additional questions.

No-code/low-code data prep and visualization

Request Demo
Try for Free