Duplicates API
This is the API reference for all functions designed to find duplicate records in a DataFrame. You can find usage examples here.
hh.find_duplicates
find_duplicates(
df: DataFrame,
name_col: str | list[str],
date_of_birth_col: str,
postcode_col: str,
id_col: str = None,
threshold: int = 80,
fuzzy_type: str = "permissive",
twin_protection: bool = True,
twin_protection_threshold: int = 70,
) -> pd.DataFrame
Attempts to find duplicate records within one DataFrame. The function looks for exact matches on any columns passed to name_col, date_of_birth_col and postcode_col, and then attempts to fuzzy match names using either date_of_birth_col or date_of_birth_col and postcode_col to create blocks of potential matches. Strictness of duplicate matching can be controlled using threshold (% match for fuzzy name matching), fuzzy type (permission or strict) which pools potential duplicates for matching by using either date of birth or date of birth and postcode, and setting twin_protection to True/False. Twin Protection isolates first names in potential matches to filter out people with totally different first names. This is not totally failsafe and may still return some twins as potential duplicates.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The DataFrame contain records to check for duplicates. |
required |
name_col
|
str | list[str]
|
The column or list of columns contain names. Pass a list in the order the columns should be joined to create a full name e.g. ['First Name', 'Middle Name', 'Last Name']. |
required |
date_of_birth_col
|
str
|
The column containing date of birth. |
required |
postcode_col
|
str
|
The column containing postcode. |
required |
id_col
|
str
|
If there is already a column in your DataFrame which contains some kind of ID number, set it here. Otherwise, one will be created. Defaults to None. |
None
|
threshold
|
int
|
The threshold for fuzzy matching. The percentage match of the name. Defaults to 80. |
80
|
fuzzy_type
|
str
|
Controls whether date_of_birth_col or date_of_birth_col and postcode_col are used to create blocks for fuzzy matching. 'permissive' uses only date_of_birth_col, so will find duplicates with different postcodes. 'strict' uses both columns, so will only return potential duplicates where both date of birth and postcode match. Defaults to "permissive". |
'permissive'
|
twin_protection
|
bool
|
If True, this filters out suspected twins with less similar first names (<70% match) from returned potential duplicates. Defaults to True. |
True
|
twin_protection_threshold
|
int
|
The threshold for first name matching when twin_protection is True. Defaults to 70. |
70
|
Raises:
| Type | Description |
|---|---|
TypeError
|
Raised if df is not a DataFrame. |
ValueError
|
Raised if threshold is not a value between 0 and 100 or if fuzzy_type is not 'strict' or 'permissive'. |
ColumnDoesNotExistError
|
Raised if any of the columns passed as args are not in df. |
Returns:
| Type | Description |
|---|---|
DataFrame
|
A DataFrame with a column called 'Potential Duplicates' which contains a list of IDs for any potential duplicates found by the function. |