Python: How to match nested parentheses with regex?


Question

I'm trying to match a mathematical-expression-like string, that have nested parentheses.

import re

p = re.compile('\(.+\)')
str = '(((1+0)+1)+1)'
print p.findall(s)

['(((1+0)+1)+1)']

I wanted it to match all the enclosed expressions, such as (1+0), ((1+0)+1)...
I don't even care if it matches unwanted ones like (((1+0), I can take care of those.

Why it's not doing that already, and how can I do it?

1
12
3/28/2011 3:22:47 AM

Accepted Answer

The regular expression tries to match as much of the text as possible, thereby consuming all of your string. It doesn't look for additional matches of the regular expression on parts of that string. That's why you only get one answer.

The solution is to not use regular expressions. If you are actually trying to parse math expressions, use a real parsing solutions. If you really just want to capture the pieces within parenthesis, just loop over the characters counting when you see ( and ) and increment a decrement a counter.

12
3/28/2011 3:44:04 AM

As others have mentioned, regular expressions are not the way to go for nested constructs. I'll give a basic example using pyparsing:

import pyparsing # make sure you have this installed

thecontent = pyparsing.Word(pyparsing.alphanums) | '+' | '-'
parens     = pyparsing.nestedExpr( '(', ')', content=thecontent)

Here's a usage example:

>>> parens.parseString("((a + b) + c)")

Output:

(                          # all of str
 [
  (                        # ((a + b) + c)
   [
    (                      #  (a + b)
     ['a', '+', 'b'], {}   
    ),                     #  (a + b)      [closed]
    '+',
    'c'
   ], {}
  )                        # ((a + b) + c) [closed]
 ], {}  
)                          # all of str    [closed]

(With newlining/indenting/comments done manually)

Edit: Modified to eliminate unnecessary Forward, as per Paul McGuire's suggestions.

To get the output in nested list format:

res = parens.parseString("((12 + 2) + 3)")
res.asList()

Output:

[[['12', '+', '2'], '+', '3']]

Licensed under: CC-BY-SA with attribution
Not affiliated with: Stack Overflow
Icon