Identify Duplicate Data in Pandas

This tutorial explains how to identify and handle duplicate data with pandas.

Packages

This tutorial uses:

pandas

‍

Open up a Jupyter Notebook and import the following:


import pandas as pd

‍

Creating the data

We will create a dataframe that contains multiple occurrences of duplication for this example.


df = pd.DataFrame({'A': ['A']*2 + ['A', 'A', 'B', 'A', 'B']*3 + ['A', 'A', 'B'],
                   'B': ['A']*2 + ['A', 'a', 'B', 'A', 'b']*3 + ['A', 'a', 'B'],
                   'C': ['A']*2 + ['A', 'B', 'C']*5 + ['A', 'A', 'B'],
                   'D': ['A']*2 + ['A', 'a', 'B']*5 + ['A', 'A', 'B']
                  })
df

Identify duplicates

Duplicate in all columns

The function duplicated will return a Boolean series indicating if that row is a duplicate. The parameter keep can take on the values 'first' (default) to label the first duplicate False and the rest True, 'last' to mark the last duplicate False and the rest True, or False to mark all duplicates True.


dups = df.duplicated()
dups

‍

To see the duplicate rows, use the Boolean series dups to select rows from the original dataframe.


df[dups]

Duplicate in selected columns

The function duplicated will return a Boolean series indicating if that row is a duplicate based on just the specified columns when the parameter subset is passed a list of the columns to use (in this case, A and B).


dups = df.duplicated(subset=['A', 'B'])
dups

Next, take a look at the duplicates


df[dups]

Delete duplicates

Delete only if all columns are duplicated

The function drop_duplicates will return a dataframe after dropping duplicates. The parameter keep can take on the values 'first' (default) to keep the first duplicate and drop the rest, 'last' to keep the last duplicate and drop the rest, or False to drop all duplicates.


dedup_df = df.drop_duplicates()
dedup_df

Delete only if specified columns are duplicated

The function drop_duplicates will return a dataframe after dropping all duplicates based on just the specified columns when the parameter subset is passed a list of the columns to use (in this case, A and B).


dedup_df = df.drop_duplicates(subset=['A', 'B'])
dedup_df

Identify Duplicate Data in Pandas

Packages

Creating the data

Identify duplicates

Duplicate in all columns

Duplicate in selected columns

Delete duplicates

Delete only if all columns are duplicated

Delete only if specified columns are duplicated

No-code/low-code data prep and visualization

Get your data science on.

Book a Enterprise GPT Demo

Identify Duplicate Data in Pandas

Packages

Creating the data

Identify duplicates

Duplicate in all columns

Duplicate in selected columns

Delete duplicates

Delete only if all columns are duplicated

Delete only if specified columns are duplicated

No-code/low-code data prep and visualization

Get your data science on.

Book a
Enterprise GPT Demo