Skip to content

Duplicates API

This is the API reference for all functions designed to find duplicate records in a DataFrame. You can find usage examples here.

hh.find_duplicates

find_duplicates(
    df: DataFrame,
    name_col: str | list[str],
    date_of_birth_col: str,
    postcode_col: str,
    id_col: str = None,
    threshold: int = 80,
    fuzzy_type: str = "permissive",
    twin_protection: bool = True,
    twin_protection_threshold: int = 70,
) -> pd.DataFrame

Attempts to find duplicate records within one DataFrame. The function looks for exact matches on any columns passed to name_col, date_of_birth_col and postcode_col, and then attempts to fuzzy match names using either date_of_birth_col or date_of_birth_col and postcode_col to create blocks of potential matches. Strictness of duplicate matching can be controlled using threshold (% match for fuzzy name matching), fuzzy type (permission or strict) which pools potential duplicates for matching by using either date of birth or date of birth and postcode, and setting twin_protection to True/False. Twin Protection isolates first names in potential matches to filter out people with totally different first names. This is not totally failsafe and may still return some twins as potential duplicates.

Parameters:

Name Type Description Default
df DataFrame

The DataFrame contain records to check for duplicates.

required
name_col str | list[str]

The column or list of columns contain names. Pass a list in the order the columns should be joined to create a full name e.g. ['First Name', 'Middle Name', 'Last Name'].

required
date_of_birth_col str

The column containing date of birth.

required
postcode_col str

The column containing postcode.

required
id_col str

If there is already a column in your DataFrame which contains some kind of ID number, set it here. Otherwise, one will be created. Defaults to None.

None
threshold int

The threshold for fuzzy matching. The percentage match of the name. Defaults to 80.

80
fuzzy_type str

Controls whether date_of_birth_col or date_of_birth_col and postcode_col are used to create blocks for fuzzy matching. 'permissive' uses only date_of_birth_col, so will find duplicates with different postcodes. 'strict' uses both columns, so will only return potential duplicates where both date of birth and postcode match. Defaults to "permissive".

'permissive'
twin_protection bool

If True, this filters out suspected twins with less similar first names (<70% match) from returned potential duplicates. Defaults to True.

True
twin_protection_threshold int

The threshold for first name matching when twin_protection is True. Defaults to 70.

70

Raises:

Type Description
TypeError

Raised if df is not a DataFrame.

ValueError

Raised if threshold is not a value between 0 and 100 or if fuzzy_type is not 'strict' or 'permissive'.

ColumnDoesNotExistError

Raised if any of the columns passed as args are not in df.

Returns:

Type Description
DataFrame

A DataFrame with a column called 'Potential Duplicates' which contains a list of IDs for any potential duplicates found by the function.