Welcome!
This is the community forum for my apps Pythonista and Editorial.
For individual support questions, you can also send an email. If you have a very short question or just want to say hello — I'm @olemoritz on Twitter.
Dropbox files names containing accented characters
-
I have Dropbox files names containing accented characters, like é or è.
When I ask Dropbox to generate an url, it copies a link as ...../%C3%A9, what is normally é in UTF-8.
If I want to get the file name in my 'script, usingurl = urllib.unquote(url.decode('utf-8'))
gives é.
To get "my" é, I need to encode('latin1')
Is that normal?
We say in French 'perdre son latin', for 'lost in translation' 🤕 -
I think what you need to do is
urllib.unquote(url).decode("utf-8")
. The%C3%A9
isé
encoded in UTF-8, so you first need to convert the escaped UTF-8 bytes to normal bytes, then decode that to a Unicode string.Under Python 3
urllib.unquote
does the decoding automatically, so there you can just writeurllib.unquote(url)
and you get a proper Unicodeé
. -
I think this would be the correct way to decode the URL:
url = unicode(urllib.unquote(url), 'utf-8')
or alternatively (but more confusing):
url = urllib.unquote(url).decode('utf-8')
Edit: Looks like @dgelessus was faster than me...
-
Thanks champions! My code had a misplaced right parenthesis which thus gave a bad answer.
One more time, shame on me. -
Sorry, but I still have problems with that.
Try this short code,
if the URL is passed "by appex", it is NOT OK
if the URL is set as text for testing, it's OK# coding: utf-8 import urllib import appex #url = 'https://www.dropbox.com/s/5mmxh7h7vu2lwnp/La%20vie%20tr%C3%A8s%20priv%C3%A9e%20de%20Monsieur%20Sim.png?dl=0' url = appex.get_url() print url print urllib.unquote(url).decode('utf-8')
-
appex.get_url()
returns a unicode string, so you need an extraencode
there...import urllib # This is a unicode string literal (note the 'u' before the quotes), to simulate the behavior of appex.get_url(): url = u'https://www.dropbox.com/s/5mmxh7h7vu2lwnp/La%20vie%20tr%C3%A8s%20priv%C3%A9e%20de%20Monsieur%20Sim.png' print urllib.unquote(url.encode('utf-8')).decode('utf-8')
And no, you're not the only one who finds this very confusing. ;)
-
My god! (Not you, but almost)
-
The good news is, this kind of stuff is generally a bit easier in Python 3 because pretty much every string is unicode there, and
urllib.parse.unquote
(the Python 3 equivalent ofurllib.unquote
) can handle unicode, so it would be justurllib.parse.unquote(url)
in Python 3, regardless of whetherurl
was defined as a normal string literal, or returned byappex.get_url
. -
Well this is confusing. Though here the issue looks like it's with
urllib.unquote
- it seems to be designed forstr
strings and gets confused withunicode
strings. In Python 3 it's a lot better (as always) - there the string is decoded as UTF-8 by default, and you can set a different encoding if necessary. -
I'm really still a beginner in Python and, of course, I'll buy the next version, but I hope you'll give some explanation how to convert my scripts for this version, when it would be available.
-
@cvp There is the
2to3
tool which can do most of the dumb work for you (e. g. putting parentheses around yourprint
calls). I'm not sure how well it corrects thebytes
/str
/unicode
mess that you need in Python 2. Probably not very much, as it's hard to guess whether aencode
ordecode
is actually necessary or just a compatibility hack. -
Ok, I'll try to remember when I'll use Python 3, thanks