This tutorial explains how to identify and handle duplicate data with pandas.
This tutorial uses:
Open up a Jupyter Notebook and import the following:
We will create a dataframe that contains multiple occurrences of duplication for this example.
The function duplicated will return a Boolean series indicating if that row is a duplicate. The parameter keep can take on the values 'first' (default) to label the first duplicate False and the rest True, 'last' to mark the last duplicate False and the rest True, or False to mark all duplicates True.
To see the duplicate rows, use the Boolean series dups to select rows from the original dataframe.
The function duplicated will return a Boolean series indicating if that row is a duplicate based on just the specified columns when the parameter subset is passed a list of the columns to use (in this case, A and B).
Next, take a look at the duplicates
The function drop_duplicates will return a dataframe after dropping duplicates. The parameter keep can take on the values 'first' (default) to keep the first duplicate and drop the rest, 'last' to keep the last duplicate and drop the rest, or False to drop all duplicates.
The function drop_duplicates will return a dataframe after dropping all duplicates based on just the specified columns when the parameter subset is passed a list of the columns to use (in this case, A and B).