Matching API

This is the API reference for all functions designed to help you match student data to HEAT records. You can find usage examples here.

hh.perform_exact_match

perform_exact_match(
    unmatched_df: DataFrame,
    heat_df: DataFrame,
    left_join_cols: list[str],
    right_join_cols: list[str],
    match_desc: str,
    verify: bool = False,
    heat_id_col: str = STUDENT_HEAT_ID,
) -> tuple[pd.DataFrame, pd.DataFrame]

Performs an exact match on specified columns between new data and your HEAT Student export and returns the HEAT Student ID if a match is found. This function returns two DataFrames: one containing the matches and one containing unmatched students, for passing to another matching function. This is useful to create a matching waterfall where you move through different levels of strictness.

Parameters:

Name	Type	Description	Default
`unmatched_df`	`DataFrame`	The DataFrame containing the students you want to search for.	required
`heat_df`	`DataFrame`	The DataFrame containing your HEAT Student Export.	required
`left_join_cols`	`list[str]`	Columns in new_df you want to match on.	required
`right_join_cols`	`list[str]`	Columns in heat_df you want to match on.	required
`match_desc`	`str`	A description of the match; added to a 'Match Type' col in the returned matched DataFrame. Should be descriptive to help you verify matches later, especially if joining multiple returns of this function and exporting to a .csv or Excel file.	required
`verify`	`optional`	Defaults to False. Controls whether to return all columns from heat_df to the matched DataFrame for verifying of matches. Useful if you are performing a less exact match and you want to verify the returned students. Also useful if you are using this function or perform_fuzzy_match function and want to join results together (column structure will be the same).	`False`
`heat_id_col`	`optional`	Defaults to 'Student HEAT ID'. Use this if the column in your HEAT Export with the Student ID in is not called 'Student HEAT ID'.	`STUDENT_HEAT_ID`

Raises:

Type	Description
`TypeError`	Raised if new_df or heat_df are not pandas DataFrames.
`ColumnDoesNotExistError`	Raised if a column you are trying to use for matching does not exist in either new_df or heat_df.

Returns:

Type	Description
`tuple[DataFrame, DataFrame]`	Two DataFrames: first DataFrame is matched data, second is remaining data for onward matching.

hh.perform_fuzzy_match

perform_fuzzy_match(
    unmatched_df: DataFrame,
    heat_df: DataFrame,
    left_filter_cols: list[str],
    right_filter_cols: list[str],
    left_name_col: str,
    right_name_col: str,
    match_desc: str,
    threshold: int = 80,
) -> tuple[pd.DataFrame, pd.DataFrame]

This function allows you to fuzzy match names of students in an external dataset to your HEAT Student Export to retrieve HEAT Student IDs. You can control the potential pool of fuzzy matches by specifying filter columns in both DataFrames e.g. only look for fuzzy matches where Date of Birth and Postcode matches. Note: there may be performance issues with very large datasets. If you have a large dataset, it is recommended to first use perform_exact_match to pull out exact matches and reduce the dataset before using this function for fuzzy matching.

Parameters:

Name	Type	Description	Default
`unmatched_df`	`DataFrame`	The DataFrame of students you want to fuzzy match.	required
`heat_df`	`DataFrame`	The DataFrame containing your HEAT Student Export.	required
`left_filter_cols`	`list[str]`	Filter columns in unmatched_df. By specifying a column here it will be used to control the pool of possible fuzzy matches. For example, by setting Date of birth and postcode here, it will only fuzzy match 'Jo Smith' to 'Joanne Smith' if both records have the same date of birth and postcode.	required
`right_filter_cols`	`list[str]`	Corresponding filter columns in heat_df. Must match those set in left_filter_cols.	required
`left_name_col`	`str`	Column which contains the name information (to be matched) in unmatched_df.	required
`right_name_col`	`str`	Column which contains the name information in heat_df.	required
`match_desc`	`str`	A description of the match; added to a 'Match Type' col in the returned matched DataFrame. Should be descriptive to help you verify matches later, especially if joining multiple returns of this function and exporting to a .csv or Excel file.	required
`threshold`	`optional`	The acceptable percentage match for fuzzy matching. Higher is stricter and matches will be more similar. Defaults to 80.	`80`

Raises:

Type	Description
`TypeError`	Raised if unmatched_df or heat_df are not pandas DataFrames.
`ColumnDoesNotExistError`	Raised if columns specified as filters or name columns do not exist in their DataFrames.
`FilterColumnMismatchError`	Raised if unequal number of columns specified in left and right filters.
`FuzzyMatchIndexError`	Raised when unmatched_df does not have a unique index and cannot be used for matching.

Returns:

Type	Description
`tuple[DataFrame, DataFrame]`	Two DataFrames: first DataFrame is matched data, second is remaining data for onward matching.

hh.perform_school_age_range_fuzzy_match

perform_school_age_range_fuzzy_match(
    unmatched_df: DataFrame,
    heat_df: DataFrame,
    unmatched_school_col: str,
    heat_school_col: str,
    unmatched_name_col: str,
    heat_name_col: str,
    unmatched_year_group_col: str,
    heat_dob_col: str,
    match_desc: str,
    heat_id_col: str = STUDENT_HEAT_ID,
    academic_year_start: int = CURRENT_ACADEMIC_YEAR_START,
    threshold: int = 80,
) -> tuple[pd.DataFrame, pd.DataFrame]

This function attempts to fuzzy match the names of students to your HEAT data. To control the pool of fuzzy matches, data is first matched on school name, and then uses year group to only return students with a date of birth in range for that year group. Useful if you do not know a student's date of birth, but you do know which school they attend and their year group. Returns one dataframe of matches and one dataframe of remaining unmatched data. Note: there may be performance issues with very large datasets. If you have a large dataset, it is recommended to first use perform_exact_match to pull out exact matches and reduce the dataset before using this function for fuzzy matching.

Parameters:

Name	Type	Description	Default
`unmatched_df`	`DataFrame`	DataFrame containing student records you wish to fuzzy match to HEAT records.	required
`heat_df`	`DataFrame`	DataFrame containing HEAT Student Export.	required
`unmatched_school_col`	`str`	Column which contains School name in unmatched_df.	required
`heat_school_col`	`str`	Column which contains school name in heat_df.	required
`unmatched_name_col`	`str`	Column which contains Student name in unmatched_df.	required
`heat_name_col`	`str`	Column which contains Student name in heat_df.	required
`unmatched_year_group_col`	`str`	Column in unmatched_df which contains year group for age range calculation.	required
`heat_dob_col`	`str`	Column in heat_df which contains Student Date of Birth.	required
`match_desc`	`str`	A description of the match; added to a 'Match Type' col in the returned matched DataFrame. Should be descriptive to help you verify matches later, especially if joining multiple returns of this function and exporting to a .csv or Excel file.	required
`heat_id_col`	`optional`	Column in heat_df which contains HEAT Student ID. Defaults to 'Student HEAT ID'.	`STUDENT_HEAT_ID`
`academic_year_start`	`optional`	. Defaults to start of current academic year (calculated by package).	`CURRENT_ACADEMIC_YEAR_START`
`threshold`	`optional`	The acceptable percentage match for fuzzy matching. Higher is stricter and matches will be more similar. Defaults to 80.	`80`

Raises:

Type	Description
`TypeError`	Raised if unmatched_df or heat_df are not pandas DataFrames or if heat_dob_col is not in pandas Datetime format (will try to convert first.)
`ColumnDoesNotExistError`	Raised if any specified column does not exist in its dataframe.
`FuzzyMatchIndexError`	Raised if unmatched_df does not have unique index.

Returns:

Type	Description
`tuple[DataFrame, DataFrame]`	Two DataFrames: first DataFrame is matched data, second is remaining data for onward matching.