Regex

Cook92

First off, hi! I'm pretty new to Python- having a lot of fun figuring things out. I check this forum often so thank you all for your help.

Here's my question:

Has anyone successfully installed and been able to use the regex module? USE is the key word...!

I was able to do the different steps: download from PyPi...unpack...move
I actually had to install it on my PC using In order to get a file that is generated from the setup.

Anyway after all that, lots of errors. Mainly import errors.
It seems as though some modules don't play nicely in pythonista.

I wanted to get this working because I'm trying to work with Japanese characters and the re module apparently doesn't cut it (which is true- I got some wacky matches using findall)

So- any success? If so, what did you do?

ccc

Only PyPI modules that are pure Python are possible candidates to work on Pythonista. Regex contains C code so it will not work in Pythonista. You should try to figure out what you are trying to do with the built-in re module.

dgelessus

The re module is probably fine, but most likely you're storing the Japanese text in a problematic way. This is not something you're doing wrong, but something that Python doesn't do right by default.

Python initially had no support for Unicode. If you don't know what exactly Unicode is, it's a standard that assigns numbers to basically all characters of all writing systems of the world, not just English letters like ASCII does. Although almost all versions of Python 2 have some Unicode support, the default str type and "strings" don't support Unicode for compatibility reasons.

A single character in a str is one byte, a number from 0 to 255. This is enough if you're just working with ASCII letters and maybe some other Latin letters, but not enough to fully support Unicode. This means that if you type e. g. a Japanese character in a string, it is stored in multiple bytes. Python thinks that 1 byte is 1 character, so it reads one Japanese character as multiple characters. The re module gets this string and treats the bytes of your Japanese character as multiple characters, and this is where things go wrong.

If you want to properly use non-ASCII characters in your strings, there are a few things you need to change in your code:

Add the comment # -*- coding: utf-8 -*- as the first line of your program. This line tells Python what encoding the file uses, i. e. how non-ASCII characters are stored. UTF-8 is an encoding that is compatible with ASCII, but can also encode any Unicode character. It's also what the Pythonista editor uses by default.
Add the prefix u before every string literal, e. g. "this is a string" becomes u"this is a string". This makes the string a Unicode string, which properly supports all Unicode characters.
Instead of converting objects to str if you want to convert them to text, convert them to unicode. For example str(mylist) becomes unicode(mylist). The output might look the same, but it is now a Unicode string.

(Python 3 has much better Unicode support, but Pythonista still uses Python 2, and switching to Python 3 would be a lot of work and would break user code.)

Cook92

Okay thanks...great responses!

I didn't know that regex also had C code! :)

Some of those things I knew about Unicode in Python, some I didn't. I'll try later and see what happens!

Cook92

Okay it seems to work now after using unicode().

Basically I was doing a (?<=)[dog]+(?=) search on a list of words- but with Japanese characters and different lookahead/behind. The aim is to find words that use those characters.

Without making Unicode strings I got (false positive?) results like this: 中=丸. Very weird! The list I have has 140,000+ entries so there were a lot of false matches.

Thanks again for your help!