Let's say I have a
"Hello" and a list
words = ['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo','question', 'Hallo', 'format']
How can I find the
n words that are the closest to
"Hello" and present in the list
In this case, we would have
['hello', 'hallo', 'Hallo', 'hi', 'format'...]
So the strategy is to sort the list words from the closest word to the furthest.
I thought about something like this
word = 'Hello' for i, item in enumerate(words): if lower(item) > lower(word): ...
but it's very slow in large lists.
difflib works but it's very slow also. (
words list has 630000+ words inside (sorted and one per line)). So checking the list takes 5 to 7 seconds for every search for closest word!
>>> words = ['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo', 'question', 'format'] >>> difflib.get_close_matches('Hello', words) ['hello', 'Hallo', 'hallo']
Please look at the documentation, because the function returns 3 or less closest matches by default.
There is an awesome article with a complete source code (21 lines) provided by Peter Norvig on spelling correction.
The idea is to build all possible edits of your word,
hello - helo - deletes hello - helol - transpose hello - hallo - replaces hello - heallo - inserts def edits1(word): splits = [(word[:i], word[i:]) for i in range(len(word) + 1)] deletes = [a + b[1:] for a, b in splits if b] transposes = [a + b + b + b[2:] for a, b in splits if len(b)>1] replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b] inserts = [a + c + b for a, b in splits for c in alphabet] return set(deletes + transposes + replaces + inserts)
Now, look up each of these edits in your list.
Peter's article is a great read and worth reading.