Skip to content

Matching API

This is the API reference for all functions designed to help you match student data to HEAT records. You can find usage examples here.

hh.perform_exact_match

perform_exact_match(
    unmatched_df: DataFrame,
    heat_df: DataFrame,
    left_join_cols: list[str],
    right_join_cols: list[str],
    match_desc: str,
    verify: bool = False,
    heat_id_col: str = STUDENT_HEAT_ID,
) -> tuple[pd.DataFrame, pd.DataFrame]

Performs an exact match on specified columns between new data and your HEAT Student export and returns the HEAT Student ID if a match is found. This function returns two DataFrames: one containing the matches and one containing unmatched students, for passing to another matching function. This is useful to create a matching waterfall where you move through different levels of strictness.

Parameters:

Name Type Description Default
unmatched_df DataFrame

The DataFrame containing the students you want to search for.

required
heat_df DataFrame

The DataFrame containing your HEAT Student Export.

required
left_join_cols list[str]

Columns in new_df you want to match on.

required
right_join_cols list[str]

Columns in heat_df you want to match on.

required
match_desc str

A description of the match; added to a 'Match Type' col in the returned matched DataFrame. Should be descriptive to help you verify matches later, especially if joining multiple returns of this function and exporting to a .csv or Excel file.

required
verify optional

Defaults to False. Controls whether to return all columns from heat_df to the matched DataFrame for verifying of matches. Useful if you are performing a less exact match and you want to verify the returned students. Also useful if you are using this function or perform_fuzzy_match function and want to join results together (column structure will be the same).

False
heat_id_col optional

Defaults to 'Student HEAT ID'. Use this if the column in your HEAT Export with the Student ID in is not called 'Student HEAT ID'.

STUDENT_HEAT_ID

Raises:

Type Description
TypeError

Raised if new_df or heat_df are not pandas DataFrames.

ColumnDoesNotExistError

Raised if a column you are trying to use for matching does not exist in either new_df or heat_df.

Returns:

Type Description
tuple[DataFrame, DataFrame]

Two DataFrames: first DataFrame is matched data, second is remaining data for onward matching.

hh.perform_fuzzy_match

perform_fuzzy_match(
    unmatched_df: DataFrame,
    heat_df: DataFrame,
    left_filter_cols: list[str],
    right_filter_cols: list[str],
    left_name_col: str,
    right_name_col: str,
    match_desc: str,
    threshold: int = 80,
) -> tuple[pd.DataFrame, pd.DataFrame]

This function allows you to fuzzy match names of students in an external dataset to your HEAT Student Export to retrieve HEAT Student IDs. You can control the potential pool of fuzzy matches by specifying filter columns in both DataFrames e.g. only look for fuzzy matches where Date of Birth and Postcode matches. Note: there may be performance issues with very large datasets. If you have a large dataset, it is recommended to first use perform_exact_match to pull out exact matches and reduce the dataset before using this function for fuzzy matching.

Parameters:

Name Type Description Default
unmatched_df DataFrame

The DataFrame of students you want to fuzzy match.

required
heat_df DataFrame

The DataFrame containing your HEAT Student Export.

required
left_filter_cols list[str]

Filter columns in unmatched_df. By specifying a column here it will be used to control the pool of possible fuzzy matches. For example, by setting Date of birth and postcode here, it will only fuzzy match 'Jo Smith' to 'Joanne Smith' if both records have the same date of birth and postcode.

required
right_filter_cols list[str]

Corresponding filter columns in heat_df. Must match those set in left_filter_cols.

required
left_name_col str

Column which contains the name information (to be matched) in unmatched_df.

required
right_name_col str

Column which contains the name information in heat_df.

required
match_desc str

A description of the match; added to a 'Match Type' col in the returned matched DataFrame. Should be descriptive to help you verify matches later, especially if joining multiple returns of this function and exporting to a .csv or Excel file.

required
threshold optional

The acceptable percentage match for fuzzy matching. Higher is stricter and matches will be more similar. Defaults to 80.

80

Raises:

Type Description
TypeError

Raised if unmatched_df or heat_df are not pandas DataFrames.

ColumnDoesNotExistError

Raised if columns specified as filters or name columns do not exist in their DataFrames.

FilterColumnMismatchError

Raised if unequal number of columns specified in left and right filters.

FuzzyMatchIndexError

Raised when unmatched_df does not have a unique index and cannot be used for matching.

Returns:

Type Description
tuple[DataFrame, DataFrame]

Two DataFrames: first DataFrame is matched data, second is remaining data for onward matching.

hh.perform_school_age_range_fuzzy_match

perform_school_age_range_fuzzy_match(
    unmatched_df: DataFrame,
    heat_df: DataFrame,
    unmatched_school_col: str,
    heat_school_col: str,
    unmatched_name_col: str,
    heat_name_col: str,
    unmatched_year_group_col: str,
    heat_dob_col: str,
    match_desc: str,
    heat_id_col: str = STUDENT_HEAT_ID,
    academic_year_start: int = CURRENT_ACADEMIC_YEAR_START,
    threshold: int = 80,
) -> tuple[pd.DataFrame, pd.DataFrame]

This function attempts to fuzzy match the names of students to your HEAT data. To control the pool of fuzzy matches, data is first matched on school name, and then uses year group to only return students with a date of birth in range for that year group. Useful if you do not know a student's date of birth, but you do know which school they attend and their year group. Returns one dataframe of matches and one dataframe of remaining unmatched data. Note: there may be performance issues with very large datasets. If you have a large dataset, it is recommended to first use perform_exact_match to pull out exact matches and reduce the dataset before using this function for fuzzy matching.

Parameters:

Name Type Description Default
unmatched_df DataFrame

DataFrame containing student records you wish to fuzzy match to HEAT records.

required
heat_df DataFrame

DataFrame containing HEAT Student Export.

required
unmatched_school_col str

Column which contains School name in unmatched_df.

required
heat_school_col str

Column which contains school name in heat_df.

required
unmatched_name_col str

Column which contains Student name in unmatched_df.

required
heat_name_col str

Column which contains Student name in heat_df.

required
unmatched_year_group_col str

Column in unmatched_df which contains year group for age range calculation.

required
heat_dob_col str

Column in heat_df which contains Student Date of Birth.

required
match_desc str

A description of the match; added to a 'Match Type' col in the returned matched DataFrame. Should be descriptive to help you verify matches later, especially if joining multiple returns of this function and exporting to a .csv or Excel file.

required
heat_id_col optional

Column in heat_df which contains HEAT Student ID. Defaults to 'Student HEAT ID'.

STUDENT_HEAT_ID
academic_year_start optional

. Defaults to start of current academic year (calculated by package).

CURRENT_ACADEMIC_YEAR_START
threshold optional

The acceptable percentage match for fuzzy matching. Higher is stricter and matches will be more similar. Defaults to 80.

80

Raises:

Type Description
TypeError

Raised if unmatched_df or heat_df are not pandas DataFrames or if heat_dob_col is not in pandas Datetime format (will try to convert first.)

ColumnDoesNotExistError

Raised if any specified column does not exist in its dataframe.

FuzzyMatchIndexError

Raised if unmatched_df does not have unique index.

Returns:

Type Description
tuple[DataFrame, DataFrame]

Two DataFrames: first DataFrame is matched data, second is remaining data for onward matching.