Welcome!
This is the community forum for my apps Pythonista and Editorial.
For individual support questions, you can also send an email. If you have a very short question or just want to say hello — I'm @olemoritz on Twitter.
Regex oddity
-
When trying to use the
re
module in Pythonista - I get some weird behavior. Specifically there.sub
method doesn't work as documented. Here's my code with sample text. This has been tested in multiple python regex "testers" (e.g. http://regex101.com/r/yP7bA9/1 )import re scores = [[u'Orlando 81 Washington 90 (3:55 IN 4TH)'], [u'Atlanta 59 Cleveland 87 (3:51 IN 3RD)'], [u'Utah 62 Toronto 69 (3:59 IN 3RD)'], [u'Indiana 46 Chicago 42 (0:03 IN 2ND)'], [u'Detroit 50 Memphis 51 (0:18 IN 2ND)'], [u'Minnesota 22 Dallas 28 (0:00 IN 1ST)'], [u'Brooklyn at Portland (10:00 PM ET)'], [u'San Antonio at Sacramento (10:00 PM ET)'], [u'Charlotte at Golden State (10:30 PM ET)'], [u'Phoenix at LA Clippers (10:30 PM ET)']] for score in scores: print score print re.sub('([a-zA-Z^ ]+?)(\\d+|at)\\s+?([a-zA-Z^ ]+?)(\\d+)?\\s+?(\\(.+\\))\\s+?', 'whatever replacment', score[0])
and the sample text is (there are extra spaces on the end of some lines) - it's an array of arrays:
Orlando 38 Washington 46 (1:36 IN 2ND) Atlanta 25 Cleveland 37 (0:28 IN 1ST) Utah 25 Toronto 23 (0:00 IN 1ST) Indiana at Chicago (8:00 PM ET) Detroit at Memphis (8:00 PM ET) Minnesota at Dallas (8:30 PM ET) Brooklyn at Portland (10:00 PM ET) San Antonio at Sacramento (10:00 PM ET) Charlotte at Golden State (10:30 PM ET) Phoenix at LA Clippers (10:30 PM ET)
The weird thing is that this seems to work when not in a
for
loop..... -
Part of the problem is that your regex101 does not match your gist expression... You are missing a few ?'s.
In general, it is easier to use raw strings for your expressions, that is, prefixed by an r, since you can paste the expression directly from other tools without needing to escape them.
Also, personally I find it easier to debug regular expressions first using one of the match or findall methods, building up the expression as I go using implicit string concatenation on multiple lines with comments in each group, E.g
re.findall( ('([a-zA-Z^ ]+?)' # first team name, letters, spaces and carrots '(\d+?|at)'. # either score, or word at ..... ), score[0])
then you can comment out entire lines to make sure each group works before enabling the next.
Anyway, Here's your code, all I did was copy the expression from regex101, and pasted it as a raw string. Well, I added a findall printout, and showed how you can use your groups in a sub call.
Guessing at what you are doing, I suspect sub might not be what you want... I'm thinking findall might be what your really want, which breaks this up into a table.for score in scores: print re.findall( r'([a-zA-Z^ ]+?)(\d+?|at)\s+([a-zA-Z^ ]+?)(\d+)?\s+?(\(.+\))\s*?',score[0]) print re.sub( r'([a-zA-Z^ ]+?)(\d+?|at)\s+([a-zA-Z^ ]+?)(\d+)?\s+?(\(.+\))\s*?',r'\1 --- \3',score[0])
-
@JonB
Yes, thank you!! - the raw string tip really helps - I was getting lost in "escape character hell" -
Turns out I was having an issue backreferencing an empty group - see http://bugs.python.org/issue1519638 - so even though
re.findall
was returning 5 groups - I wasn't able to usere.sub
to match/replace all the groups.EDIT: I ended up using this workaround - adding an empty sub-group
http://bugs.python.org/msg69541