is it possible to do fuzzy match merge with python pandas?


Question

I have two DataFrames which I want to merge based on a column. However, due to alternate spellings, different number of spaces, absence/presence of diacritical marks, I would like to be able to merge as long as they are similar to one another.

Any similarity algorithm will do (soundex, Levenshtein, difflib's).

Say one DataFrame has the following data:

df1 = DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])

       number
one         1
two         2
three       3
four        4
five        5

df2 = DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])

      letter
one        a
too        b
three      c
fours      d
five       e

Then I want to get the resulting DataFrame

       number letter
one         1      a
two         2      b
three       3      c
four        4      d
five        5      e
1
48
12/3/2012 10:08:10 AM

Accepted Answer

Similar to @locojay suggestion, you can apply difflib's get_close_matches to df2's index and then apply a join:

In [23]: import difflib 

In [24]: difflib.get_close_matches
Out[24]: <function difflib.get_close_matches>

In [25]: df2.index = df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)[0])

In [26]: df2
Out[26]: 
      letter
one        a
two        b
three      c
four       d
five       e

In [31]: df1.join(df2)
Out[31]: 
       number letter
one         1      a
two         2      b
three       3      c
four        4      d
five        5      e

.

If these were columns, in the same vein you could apply to the column then merge:

df1 = DataFrame([[1,'one'],[2,'two'],[3,'three'],[4,'four'],[5,'five']], columns=['number', 'name'])
df2 = DataFrame([['a','one'],['b','too'],['c','three'],['d','fours'],['e','five']], columns=['letter', 'name'])

df2['name'] = df2['name'].apply(lambda x: difflib.get_close_matches(x, df1['name'])[0])
df1.merge(df2)
61
11/5/2018 2:31:54 PM

I have written a Python package which aims to solve this problem:

pip install fuzzymatcher

You can find the repo here and docs here.

Basic usage:

Given two dataframes df_left and df_right, which you want to fuzzy join, you can write the following:

from fuzzymatcher import link_table, fuzzy_left_join

# Columns to match on from df_left
left_on = ["fname", "mname", "lname",  "dob"]

# Columns to match on from df_right
right_on = ["name", "middlename", "surname", "date"]

# The link table potentially contains several matches for each record
fuzzymatcher.link_table(df_left, df_right, left_on, right_on)

Or if you just want to link on the closest match:

fuzzymatcher.fuzzy_left_join(df_left, df_right, left_on, right_on)

Licensed under: CC-BY-SA with attribution
Not affiliated with: Stack Overflow
Icon