ru in re.sub unsupported?

peiriannydd

I get a complaint of a syntax error when I try
replaced = re.sub(ru'anā\B', u'nā', s)
It is complaining about the ru together. This syntax does work on my Mac. Is there a way this works in Pythonista?

ccc

This is about Python 2 vs. Python 3... https://github.com/ymcui/Chinese-ELECTRA/pull/57

The way to fix this and see if the r"string" == ur"string" or the u"string" == ur"string" and then use that string on both Py2 and Py3.

#!/usr/bin/env python2
# -*- coding: utf-8 -*-
print(ur'anā\B' == r'anā\B')  # True
print(ur'anā\B' == u'anā\B')  # True

peiriannydd

@ccc I'm afraid I don't follow. I'm using Py3, and neither u'anā\B' nor r'anā\B' catch the expression I need to replace. Is there anything I can do?

Thank you very much for your time.

ccc

Perhaps provide 3 strings that you want the expression to match and three similar strings that you don’t want it to match. With that set of test cases, we will see what we can do.

JonB

re.sub(re.escape('anā\B'),'nā', 'anā\B')
#seems to work.  or, 
re.sub('anā\\\\B','nā', 'anā\B')
#or
re.sub('anā\\\B','nā', 'anā\B')
#or
re.sub(r'anā\\B', 'nā', 'anā\B')

Not sure i fully understand why, but i dont understand unicode at a practical level... -- I guess\Bis not a valid escape code, so must be represented by \\B in fact, if you just type '\B' at the console, and you will see it gets converted for you. But, i guess re doesnt like that -- using the flags=re.DEBUG it complains about a NIN BOUNDARY CHARACTER

mikael

@peiriannydd

This:

re.sub(r'anā\\B', 'nā', 'anā\\B')

... is to me the ”stereotypical” approach: use r to avoid too many backslashes in the regexp, and avoid confusing the meaning of regexp backslashes with those used to denote special characters in string literals with those already in a string and having no special meaning whatever.

Unicode has no impact here, as long as we are all happily using str in Py3.

To me, the first paragraphs of the official Python docs for re seem to cover all of this nicely.