I'm looking for Python code that removes C and C++ comments from a string. (Assume the string contains an entire C source file.)
I realize that I could .match() substrings with a Regex, but that doesn't solve nesting
/*, or having a
// inside a
Ideally, I would prefer a non-naive implementation that properly handles awkward cases.
I don't know if you're familiar with
sed, the UNIX-based (but Windows-available) text parsing program, but I've found a sed script here which will remove C/C++ comments from a file. It's very smart; for example, it will ignore '//' and '/*' if found in a string declaration, etc. From within Python, it can be used using the following code:
import subprocess from cStringIO import StringIO input = StringIO(source_code) # source_code is a string with the source code. output = StringIO() process = subprocess.Popen(['sed', '/path/to/remccoms3.sed'], input=input, output=output) return_code = process.wait() stripped_code = output.getvalue()
In this program,
source_code is the variable holding the C/C++ source code, and eventually
stripped_code will hold C/C++ code with the comments removed. Of course, if you have the file on disk, you could have the
output variables be file handles pointing to those files (
input in read-mode,
output in write-mode).
remccoms3.sed is the file from the above link, and it should be saved in a readable location on disk.
sed is also available on Windows, and comes installed by default on most GNU/Linux distros and Mac OS X.
This will probably be better than a pure Python solution; no need to reinvent the wheel.
This handles C++-style comments, C-style comments, strings and simple nesting thereof.
def comment_remover(text): def replacer(match): s = match.group(0) if s.startswith('/'): return " " # note: a space and not an empty string else: return s pattern = re.compile( r'//.*?$|/\*.*?\*/|\'(?:\\.|[^\\\'])*\'|"(?:\\.|[^\\"])*"', re.DOTALL | re.MULTILINE ) return re.sub(pattern, replacer, text)
Strings needs to be included, because comment-markers inside them does not start a comment.
Edit: re.sub didn't take any flags, so had to compile the pattern first.
Edit2: Added character literals, since they could contain quotes that would otherwise be recognized as string delimiters.
Edit3: Fixed the case where a legal expression
int/**/x=5; would become
intx=5; which would not compile, by replacing the comment with a space rather then an empty string.