Solved python regex raising exception “unmatched group”

If your a regex guru, and you know why you came here, you can go straight to the brief explanation. If not just keep reading.

I found a workaround for python bug 1519638. It most definitely will not solve all of the puzzles out there but it stops breaking the sub method for replacing with the use of backrefs.

The problem

If you would like to replace this:

<label for="author"><small>Name

With this:

<label for="author"><small>Naam

And you’re not sure if the <small> tags is there, you would group the chars “<small>” and use a question mark for making them optional. BTW, running a replace on just “Name” is not allowed because they would mess up other parts of the file in question.

Example updated. Thanx dbr!

The solution

Using a compiled pattern and thus a regex to replace this, a solution might look like this:

reg = re.compile(r'(<label for="author">)(<small>)?(Name)', \
    re.VERBOSE | re.MULTILINE | re.DOTALL)
replace = r'\g<1>\g<2>\g<3>'
search = reg.sub(replace, data)

In this case the replacement string uses backreferences to the groups being the sub expressions within the parenthesis in the search pattern.

The oops

However, if the “<small>” tag is not there the search command raises an exception.

$ python regex.py
Traceback (most recent call last):
  File "regex.py", line 14, in <module>
    search = reg.sub(replace, data)
  File "/usr/lib/python2.5/re.py", line 274, in filter
    return sre_parse.expand_template(template, match)
  File "/usr/lib/python2.5/sre_parse.py", line 793, in expand_template
    raise error, "unmatched group"
sre_constants.error: unmatched group

This happens because the second group represented with “\g<2>” in the replacement string returns a “None” instead of an empty string. That is (seems) the bug.

Solving the oops

This can be resolved by replacing the optional notation “(<small>)?” with an alternation “(|<small>)” because with the “<small>” tag being absent it matches on the empty subexpression. And then it actually returns an empty string so the search command won’t raise the exception.

In other words …

Brief explanation

When doing a search and replace with sub, replace the group represented as optional for a group represented as an alternation with one empty subexpression. So instead of this “(.+?)?” use this “(|.+?)” (without the double quotes).

If there’s nothing matched by this group the empty subexpression matches. Then an empty string is returned instead of a None and the sub method is executed normally instead of raising the “unmatched group” error.

That’s all folks …

Dit bericht is geplaatst in All ENGLISH articles, Technical met de tags , , , , . Bookmark de permalink.

10 Reacties op Solved python regex raising exception “unmatched group”

  1. nneonneo schreef:

    Hi:

    I’m the original poster of the bug, and I’m writing to thank you for the fix.

    It’s been almost two years since I posted that bug, and I’d lost hope that it would be fixed. Your workaround will allow me to fix my scripts to finally avoid the silly hacks I’ve been using in the meantime.

    Again, thank you! :)

  2. dbr schreef:

    Not sure this is really a bug.. You are trying to use a referenced group that might not exist (the “()?” one)..

    Not exactly sure of what you are trying to achieve in the end (The example could be done by data.replace(“Blue”,”Red”) ), but the way I’d do it is..

    import re
    data = “Blue”

    reg = re.compile(r’((?:)?)(Blue)’, re.VERBOSE | re.MULTILINE | re.DOTALL)
    print reg.sub(“\gBlue”, data)

  3. dbr schreef:

    Err, the comment system messed up the quotes and angle-brackets.. I posted the same comment on reddit: http://www.reddit.com/info/6rbg9/comments/#c04nqd4

  4. Gerard schreef:

    To “nneonneo”: Very welcome! It gave me a headache for about a week .. :-)

  5. Gerard schreef:

    To “dbr”: It is indeed debatable whether it is a bug or not. And the example, now that I read your response, is not that accurate.

    The replace I was going for originated from this one:

    (<label\ for="author">)(|.+?)Name
    \g<1>\g<2>Naam
    

    Where “\g<2>” could be any font like atribute in html tag form. And I could not just do your trick on the “Name” to “Naam” replace, because that would most definitely mess up other parts of files I was going through.

    Anyway thanx for the repost on the brackets/quotes issue. I was passed the point of no return on that .. ;-)

    Regards,

    Gerard.

  6. Jon schreef:

    This is NOT a bug.
    In most regex libraries the ()? match will return null if the sub expression does not match. Look at Java, C#, etc; They all do this.

  7. Gerard schreef:

    To “Jon”: I merely mentioned the term bug because it is on the python bug list. Nevertheless, one could debate on whether it is or not.

    Thanx for the heads-up!

    Gerard.

  8. Gerard,

    remind me, are you my ex-colleague from Energis/Enertel?

    If so, it’s quite bizarre that I see you meddle with Python at about the same time I have a Python programming job. :)

  9. Gerard schreef:

    Most likely,

    Sent you an email … :-)

  10. Daggering schreef:

    Awesome site man. It is easy to see that you like blogging.

Geef een reactie

Je e-mailadres wordt niet gepubliceerd. Verplichte velden zijn gemarkeerd met *

*

* Copy this password:

* Type or paste password here:

7,573 Spam Comments Blocked so far by Spam Free Wordpress

De volgende HTML tags en attributen zijn toegestaan: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>