I'm having a bit of trouble getting a Python regex to work when matching against text that spans multiple lines. The example text is ('\n' is a newline)
some Varying TEXT\n \n DSJFKDAFJKDAFJDSAKFJADSFLKDLAFKDSAF\n [more of the above, ending with a newline]\n [yep, there is a variable number of lines here]\n \n (repeat the above a few hundred times).
I'd like to capture two things: the 'some_Varying_TEXT' part, and all of the lines of uppercase text that comes two lines below it in one capture (i can strip out the newline characters later). I've tried with a few approaches:
re.compile(r"^>(\w+)$$([.$]+)^$", re.MULTILINE) # try to capture both parts re.compile(r"(^[^>][\w\s]+)$", re.MULTILINE|re.DOTALL) # just textlines
and a lot of variations hereof with no luck. The last one seems to match the lines of text one by one, which is not what I really want. I can catch the first part, no problem, but I can't seem to catch the 4-5 lines of uppercase text. I'd like match.group(1) to be some_Varying_Text and group(2) to be line1+line2+line3+etc until the empty line is encountered.
If anyone's curious, its supposed to be a sequence of aminoacids that make up a protein.
I think your biggest problem is that you're expecting the
$ anchors to match linefeeds, but they don't. In multiline mode,
^ matches the position immediately following a newline and
$ matches the position immediately preceding a newline.
Be aware, too, that a newline can consist of a linefeed (\n), a carriage-return (\r), or a carriage-return+linefeed (\r\n). If you aren't certain that your target text uses only linefeeds, you should use this more inclusive version of the regex:
BTW, you don't want to use the DOTALL modifier here; you're relying on the fact that the dot matches everything except newlines.
This will work:
>>> import re >>> rx_sequence=re.compile(r"^(.+?)\n\n((?:[A-Z]+\n)+)",re.MULTILINE) >>> rx_blanks=re.compile(r"\W+") # to remove blanks and newlines >>> text="""Some varying text1 ... ... AAABBBBBBCCCCCCDDDDDDD ... EEEEEEEFFFFFFFFGGGGGGG ... HHHHHHIIIIIJJJJJJJKKKK ... ... Some varying text 2 ... ... LLLLLMMMMMMNNNNNNNOOOO ... PPPPPPPQQQQQQRRRRRRSSS ... TTTTTUUUUUVVVVVVWWWWWW ... """ >>> for match in rx_sequence.finditer(text): ... title, sequence = match.groups() ... title = title.strip() ... sequence = rx_blanks.sub("",sequence) ... print "Title:",title ... print "Sequence:",sequence ... print ... Title: Some varying text1 Sequence: AAABBBBBBCCCCCCDDDDDDDEEEEEEEFFFFFFFFGGGGGGGHHHHHHIIIIIJJJJJJJKKKK Title: Some varying text 2 Sequence: LLLLLMMMMMMNNNNNNNOOOOPPPPPPPQQQQQQRRRRRRSSSTTTTTUUUUUVVVVVVWWWWWW
Some explanation about this regular expression might be useful:
^) means "starting at the beginning of a line". Be aware that it does not match the newline itself (same for $: it means "just before a newline", but it does not match the newline itself).
(.+?)\n\nmeans "match as few characters as possible (all characters are allowed) until you reach two newlines". The result (without the newlines) is put in the first group.
[A-Z]+\nmeans "match as many upper case letters as possible until you reach a newline. This defines what I will call a textline.
)+)means match one or more textlines but do not put each line in a group. Instead, put all the textlines in one group.
\nin the regular expression if you want to enforce a double newline at the end.
\r\n) then just fix the regular expression by replacing every occurrence of