Welcome!
This is the community forum for my apps Pythonista and Editorial.
For individual support questions, you can also send an email. If you have a very short question or just want to say hello — I'm @olemoritz on Twitter.
One more time, I've a problem with accents
-
One more time, I've a problem with accents
I use Pythonista 3, in Python 3
I compare a file name (got with ftplib nlst) to a file name built with alert.dialog text field.
The file name is Xxxé.
When I print on comsole or display in a ui.label, each variable shows Xxxé, but when I compare both variables, they are different.
When I loop on each character to print it, I get
Xxxe ́ for the ftp file name
Xxxé for the otherI really need help to understand and to solve my problem
Thanks in advance -
You could try normalizing both strings before comparing them, using
unicodedata.normalize
, e.g.import unicodedata # ... filename = unicodedata.normalize('NFC', filename) dlg_text = unicodedata.normalize('NFC', dlg_text) if filename == dlg_text: #...
(Source)
-
Thanks a lot, that solves my problem, but I don't understand the kind /encoding a of a string which contains/prints/displays é but prints e' when I loop on each character!
-
@cvp Unicode has multiple ways of representing accented characters. Most accented characters have their own code point, for example é is U+00E9 (LATIN SMALL LETTER E WITH ACUTE). But almost all accents also exist as separate "combining" characters, which you can place after another character to add an accent to it. This means that you can also write é as U+0065 (LATIN SMALL LETTER E) followed by U+0301 (COMBINING ACUTE ACCENT).
Both variants of é look the same when you display them, and most systems even treat "split up" characters as one character in text fields and such, so if you delete a "split up" character it removes the entire character and not just the accent. But if you look at the string character by character, you'll notice that they are actually different.
That's why Unicode defines four forms of "normalization" for strings. Form "NFC" combines all letters and their accents into a single character if possible ("composition"), and form "NFD" splits them into separate letter and combining accents if possible ("decomposition"). There are also the "compatibility" forms "NFKC" and "NFKD", which do a few additional conversions. (Look up "Unicode equivalence" on Wikipedia if you want more details.)
In most cases NFC is all you need, sometimes NFKC can be useful, and NFD and NFKD are almost never useful. But Apple's HFS+ file system (also called Mac OS Extended) uses the NFD form for file names, which means that if your FTP server is a Mac, it will give you decomposed characters, instead of normal composed characters like most other programs and services.
-
Thanks for your clear explanation.
Coming from IBM world, I had always used the EBCDIC code, where all machines "speak" the same language.
Thus, I'm still afraid that I could use a code to send a file to my Mac or NAS and that the file name or folder name would be unreadable by another system. -
Thanks, I'll have a look
-
@cvp That's probably just spam...
-
You're right. Just checked and seems strange. Thanks