Generating an MD5 checksum of a file


Question

Is there any simple way of generating (and checking) MD5 checksums of a list of files in Python? (I have a small program I'm working on, and I'd like to confirm the checksums of the files).

1
290
4/13/2016 2:52:55 AM

Accepted Answer

You can use hashlib.md5()

Note that sometimes you won't be able to fit the whole file in memory. In that case, you'll have to read chunks of 4096 bytes sequentially and feed them to the Md5 function:

def md5(fname):
    hash_md5 = hashlib.md5()
    with open(fname, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

Note: hash_md5.hexdigest() will return the hex string representation for the digest, if you just need the packed bytes use return hash_md5.digest(), so you don't have to convert back.

385
10/26/2018 2:07:20 AM

There is a way that's pretty memory inefficient.

single file:

import hashlib
def file_as_bytes(file):
    with file:
        return file.read()

print hashlib.md5(file_as_bytes(open(full_path, 'rb'))).hexdigest()

list of files:

[(fname, hashlib.md5(file_as_bytes(open(fname, 'rb'))).digest()) for fname in fnamelst]

Recall though, that MD5 is known broken and should not be used for any purpose since vulnerability analysis can be really tricky, and analyzing any possible future use your code might be put to for security issues is impossible. IMHO, it should be flat out removed from the library so everybody who uses it is forced to update. So, here's what you should do instead:

[(fname, hashlib.sha256(file_as_bytes(open(fname, 'rb'))).digest()) for fname in fnamelst]

If you only want 128 bits worth of digest you can do .digest()[:16].

This will give you a list of tuples, each tuple containing the name of its file and its hash.

Again I strongly question your use of MD5. You should be at least using SHA1, and given recent flaws discovered in SHA1, probably not even that. Some people think that as long as you're not using MD5 for 'cryptographic' purposes, you're fine. But stuff has a tendency to end up being broader in scope than you initially expect, and your casual vulnerability analysis may prove completely flawed. It's best to just get in the habit of using the right algorithm out of the gate. It's just typing a different bunch of letters is all. It's not that hard.

Here is a way that is more complex, but memory efficient:

import hashlib

def hash_bytestr_iter(bytesiter, hasher, ashexstr=False):
    for block in bytesiter:
        hasher.update(block)
    return hasher.hexdigest() if ashexstr else hasher.digest()

def file_as_blockiter(afile, blocksize=65536):
    with afile:
        block = afile.read(blocksize)
        while len(block) > 0:
            yield block
            block = afile.read(blocksize)


[(fname, hash_bytestr_iter(file_as_blockiter(open(fname, 'rb')), hashlib.md5()))
    for fname in fnamelst]

And, again, since MD5 is broken and should not really ever be used anymore:

[(fname, hash_bytestr_iter(file_as_blockiter(open(fname, 'rb')), hashlib.sha256()))
    for fname in fnamelst]

Again, you can put [:16] after the call to hash_bytestr_iter(...) if you only want 128 bits worth of digest.


Licensed under: CC-BY-SA with attribution
Not affiliated with: Stack Overflow
Icon