How do I remove hex values in a python string with regular expressions?


Question

I have a cell array in matlab

columns = {'MagX', 'MagY', 'MagZ', ...
           'AccelerationX',  'AccelerationX',  'AccelerationX', ...
           'AngularRateX', 'AngularRateX', 'AngularRateX', ...
           'Temperature'}

I use these scripts which make use of matlab's hdf5write function to save the array in the hdf5 format.

I then read in the the hdf5 file into python using pytables. The cell array comes in as a numpy array of strings. I convert to a list and this is the output:

>>>columns
['MagX\x00\x00\x00\x08\x01\x008\xe6\x7f',
 'MagY\x00\x7f\x00\x00\x00\xee\x0b9\xe6\x7f',
 'MagZ\x00\x00\x00\x00\x001',
 'AccelerationX',
 'AccelerationY',
 'AccelerationZ',
 'AngularRateX',
 'AngularRateY',
 'AngularRateZ',
 'Temperature']

These hex values pop into the strings from somewhere and I'd like to remove them. They don't always appear on the first three items of the list and I need a nice way to deal with them or to find out why they are there in the first place.

>>>print columns[0]
Mag8�
>>>columns[0]
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'
>>>repr(columns[0])
"'MagX\\x00\\x00\\x00\\x08\\x01\\x008\\xe6\\x7f'"
>>>print repr(columns[0])
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'

I've tried using a regular expression to remove the hex values but have little luck.

>>>re.sub('(\w*)\\\\x.*', '\1', columns[0])
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'
>>>re.sub('(\w*)\\\\x.*', r'\1', columns[0])
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'
>>>re.sub(r'(\w*)\\x.*', '\1', columns[0])
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'
>>>re.sub('([A-Za-z]*)\x00', r'\1', columns[0])
'MagX\x08\x018\xe6\x7f'
>>>re.sub('(\w*?)', '\1', columns[0])
'\x01M\x01a\x01g\x01X\x01\x00\x01\x00\x01\x00\x01\x08\x01\x01\x01\x00\x018\x01\xe6\x01\x7f\x01'

Any suggestions on how to deal with this?

1
3
3/4/2011 6:53:30 PM

Accepted Answer

You can remove all non-word characters in the following way:

>>> re.sub(r'[^\w]', '', 'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f')
'MagX8'

The regex [^\w] will match any character that is not a letter, digit, or underscore. By providing that regex in re.sub with an empty string as a replacement you will delete all other characters in the string.

Since there may be other characters you want to keep, a better solution might be to specify a larger range of characters that you want to keep that excludes control characters. For example:

>>> re.sub(r'[^\x20-\x7e]', '', 'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f')
'MagX8'

Or you could replace [^\x20-\x7e] with the equivalent [^ -~], depending on which seems more clear to you.

To exclude all characters after this first control character just add a .*, like this:

>>> re.sub(r'[^ -~].*', '', 'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f')
'MagX'
7
3/4/2011 7:24:28 PM

They're not actually in the string: you have unescaped control characters, which Python displays using the hexadecimal notation - that's why you see a unusual symbol when you print the value.

You should simply be able to remove the extra levels of quoting in your regular expression but you might also simply rely on something like the regexp module's generic whitespace class, which will match whitespace characters other than tabs and spaces:

>>> import re
>>> re.sub(r'\s', '?', "foo\x00bar")
'foo\x00bar'
>>> print re.sub(r'\s', '?', "foo\x00bar")
foobar

I use this one a bit to replace all input whitespace runs, including non-breaking space characters, with a single space:

>>> re.sub(r'[\xa0\s]+', ' ', input_str)

Licensed under: CC-BY-SA with attribution
Not affiliated with: Stack Overflow
Icon