I have a string that I want to use as a filename, so I want to remove all characters that wouldn't be allowed in filenames, using Python.
I'd rather be strict than otherwise, so let's say I want to retain only letters, digits, and a small set of other characters like
"_-.() ". What's the most elegant solution?
The filename needs to be valid on multiple operating systems (Windows, Linux and Mac OS) - it's an MP3 file in my library with the song title as the filename, and is shared and backed up between 3 machines.
This is the solution I ultimately used:
import unicodedata validFilenameChars = "-_.() %s%s" % (string.ascii_letters, string.digits) def removeDisallowedFilenameChars(filename): cleanedFilename = unicodedata.normalize('NFKD', filename).encode('ASCII', 'ignore') return ''.join(c for c in cleanedFilename if c in validFilenameChars)
The unicodedata.normalize call replaces accented characters with the unaccented equivalent, which is better than simply stripping them out. After that all disallowed characters are removed.
My solution doesn't prepend a known string to avoid possible disallowed filenames, because I know they can't occur given my particular filename format. A more general solution would need to do so.
You can look at the Django framework for how they create a "slug" from arbitrary text. A slug is URL- and filename- friendly.
The Django text utils define a function,
slugify(), that's probably the gold standard for this kind of thing. Essentially, their code is the following.
def slugify(value): """ Normalizes string, converts to lowercase, removes non-alpha characters, and converts spaces to hyphens. """ import unicodedata value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore') value = unicode(re.sub('[^\w\s-]', '', value).strip().lower()) value = unicode(re.sub('[-\s]+', '-', value))
There's more, but I left it out, since it doesn't address slugification, but escaping.