I have two DataFrames which I want to merge based on a column. However, due to alternate spellings, different number of spaces, absence/presence of diacritical marks, I would like to be able to merge as long as they are similar to one another.
Any similarity algorithm will do (soundex, Levenshtein, difflib's).
Say one DataFrame has the following data:
df1 = DataFrame([,,,,], index=['one','two','three','four','five'], columns=['number']) number one 1 two 2 three 3 four 4 five 5 df2 = DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter']) letter one a too b three c fours d five e
Then I want to get the resulting DataFrame
number letter one 1 a two 2 b three 3 c four 4 d five 5 e
In : import difflib In : difflib.get_close_matches Out: <function difflib.get_close_matches> In : df2.index = df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)) In : df2 Out: letter one a two b three c four d five e In : df1.join(df2) Out: number letter one 1 a two 2 b three 3 c four 4 d five 5 e
If these were columns, in the same vein you could apply to the column then
df1 = DataFrame([[1,'one'],[2,'two'],[3,'three'],[4,'four'],[5,'five']], columns=['number', 'name']) df2 = DataFrame([['a','one'],['b','too'],['c','three'],['d','fours'],['e','five']], columns=['letter', 'name']) df2['name'] = df2['name'].apply(lambda x: difflib.get_close_matches(x, df1['name'])) df1.merge(df2)
I have written a Python package which aims to solve this problem:
pip install fuzzymatcher
Given two dataframes
df_right, which you want to fuzzy join, you can write the following:
from fuzzymatcher import link_table, fuzzy_left_join # Columns to match on from df_left left_on = ["fname", "mname", "lname", "dob"] # Columns to match on from df_right right_on = ["name", "middlename", "surname", "date"] # The link table potentially contains several matches for each record fuzzymatcher.link_table(df_left, df_right, left_on, right_on)
Or if you just want to link on the closest match:
fuzzymatcher.fuzzy_left_join(df_left, df_right, left_on, right_on)