Using regex to remove comments from source files


Question

I'm making a program to automate the writing of some C code, (I'm writing to parse strings into enumerations with the same name) C's handling of strings is not that great. So some people have been nagging me to try python.

I made a function that is supposed to remove C-style /* COMMENT */ and //COMMENT from a string: Here is the code:

def removeComments(string):
    re.sub(re.compile("/\*.*?\*/",re.DOTALL ) ,"" ,string) # remove all occurance streamed comments (/*COMMENT */) from string
    re.sub(re.compile("//.*?\n" ) ,"" ,string) # remove all occurance singleline comments (//COMMENT\n ) from string

So I tried this code out.

str="/* spam * spam */ eggs"
removeComments(str)
print str

And it apparently did nothing.

Any suggestions as to what I've done wrong?

There's a saying I've heard a couple of times:

If you have a problem and you try to solve it with Regex you end up with two problems.


EDIT: Looking back at this years later. (after a fair bit more parsing experience)

I think regex may have been the right solution. And the simple regex used here "good enough". I may not have emphasized this enough in the question. This was for a single specific file. That had no tricky situations. I think it would be a lot less maintenance to keep the file being parsed simple enough for the regex, than to complicate the regex, into an unreadable symbol soup.

1
23
6/8/2017 6:53:54 PM

Accepted Answer

re.sub returns a string, so changing your code to the following will give results:

def removeComments(string):
    string = re.sub(re.compile("/\*.*?\*/",re.DOTALL ) ,"" ,string) # remove all occurrences streamed comments (/*COMMENT */) from string
    string = re.sub(re.compile("//.*?\n" ) ,"" ,string) # remove all occurrence single-line comments (//COMMENT\n ) from string
    return string
28
4/7/2019 3:51:50 PM

Many answers are given already but;
what about "//comment-like strings inside quotes"?

OP is asking how to do do it using regular expressions; so:

def remove_comments(string):
    pattern = r"(\".*?\"|\'.*?\')|(/\*.*?\*/|//[^\r\n]*$)"
    # first group captures quoted strings (double or single)
    # second group captures comments (//single-line or /* multi-line */)
    regex = re.compile(pattern, re.MULTILINE|re.DOTALL)
    def _replacer(match):
        # if the 2nd group (capturing comments) is not None,
        # it means we have captured a non-quoted (real) comment string.
        if match.group(2) is not None:
            return "" # so we will return empty to remove the comment
        else: # otherwise, we will return the 1st group
            return match.group(1) # captured quoted-string
    return regex.sub(_replacer, string)

This WILL remove:

  • /* multi-line comments */
  • // single-line comments

Will NOT remove:

  • String var1 = "this is /* not a comment. */";
  • char *var2 = "this is // not a comment, either.";
  • url = 'http://not.comment.com';

Note: This will also work for Javascript source.


Licensed under: CC-BY-SA with attribution
Not affiliated with: Stack Overflow
Icon