Python: find closest string (from a list) to another string


Question

Let's say I have a string "Hello" and a list

words = ['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo','question', 'Hallo', 'format']

How can I find the n words that are the closest to "Hello" and present in the list words ?

In this case, we would have ['hello', 'hallo', 'Hallo', 'hi', 'format'...]

So the strategy is to sort the list words from the closest word to the furthest.

I thought about something like this

word = 'Hello'
for i, item in enumerate(words):
    if lower(item) > lower(word):
      ...

but it's very slow in large lists.

UPDATE difflib works but it's very slow also. (words list has 630000+ words inside (sorted and one per line)). So checking the list takes 5 to 7 seconds for every search for closest word!

1
49
4/4/2012 9:32:24 PM

Accepted Answer

Use difflib.get_close_matches.

>>> words = ['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo', 'question', 'format']
>>> difflib.get_close_matches('Hello', words)
['hello', 'Hallo', 'hallo']

Please look at the documentation, because the function returns 3 or less closest matches by default.

77
4/4/2012 8:27:09 PM

There is an awesome article with a complete source code (21 lines) provided by Peter Norvig on spelling correction.

http://norvig.com/spell-correct.html

The idea is to build all possible edits of your word,

hello - helo   - deletes    
hello - helol  - transpose    
hello - hallo  - replaces    
hello - heallo - inserts    


def edits1(word):
   splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes    = [a + b[1:] for a, b in splits if b]
   transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
   replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
   inserts    = [a + c + b     for a, b in splits for c in alphabet]
   return set(deletes + transposes + replaces + inserts)

Now, look up each of these edits in your list.

Peter's article is a great read and worth reading.


Licensed under: CC-BY-SA with attribution
Not affiliated with: Stack Overflow
Icon