I received some text that is encoded, but I don't know what charset was used. Is there a way to determine the encoding of a text file using Python? How can I detect the encoding/codepage of a text file deals with C#.
Correctly detecting the encoding all times is impossible.
(From chardet FAQ:)
However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds “txzqJv 2!dasd0a QqdKjvz” will instantly recognize that that isn't English (even though it is composed entirely of English letters). By studying lots of “typical” text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text's language.
There is the chardet library that uses that study to try to detect encoding. chardet is a port of the auto-detection code in Mozilla.
You can also use UnicodeDammit. It will try the following methods:
import magic blob = open('unknown-file').read() m = magic.open(magic.MAGIC_MIME_ENCODING) m.load() encoding = m.buffer(blob) # "utf-8" "us-ascii" etc
There is an identically named, but incompatible, python-magic pip package on pypi that also uses
libmagic. It can also get the encoding, by doing:
import magic blob = open('unknown-file').read() m = magic.Magic(mime_encoding=True) encoding = m.from_buffer(blob)