Welcome!
This is the community forum for my apps Pythonista and Editorial.
For individual support questions, you can also send an email. If you have a very short question or just want to say hello — I'm @olemoritz on Twitter.
Time to release a new version to App Store
-
@Matteo Try cchardet https://github.com/PyYoshi/cChardet/blob/master/README.rst . 0.37 call/s VS 1467.77 call /s ≈ 1/3600
-
@ccc How about Pandas?
-
I am afraid that I do not understand. Pyto has Pandas but not lxml. What is the use case?
-
For chardet, if the use case is in
requests
, one can set r.encoding manually (for instance, for repeated messages where you know encoding won't change), and just bypass chardet altogether.For lxml, as has been discussed, apple won't allow it, unless the dev completely refactors lxml and it's dependencies to use non-private api names.
Pandas been discussed a lot, and you ought to know not to depend on it soon inside pythonista. (I wonder: does pyto yet support url schemes/shortcuts? Pythonista IDE + Pyto as an "engine" might support the workflow of people who prefer the pythonista user interface, but want pandas or compile their own app with modules of their choosing.)
-
Let’s see if https://github.com/lxml/lxml/pull/281 sparks any solutions...
-
@ccc Pandas is the most popular tool to analyse data. https://pandas.pydata.org/
-
@JonB Sorry but I'm using aiohttp. And for most of us, we don't set r.encoding manually, unless we are sure that chardet can't detect rightly.
-
I know what Pandas is but my question was about lxml, not Pandas.
-
-
No one is questioning that lxml is superior. But Apple's app store forbids it. Email apple with your complaint.
-
-
I gotta say @lpl if you do not have a use case for a capability then please stop complaining about its absence. Nice to have does not help us. We are in search of needs that justify the investment of time, effort, and the addition of risk.
-
@lpl Hi, interesting, you are right, the author (PyYoshi) used a txt file in Japanese language (about 330 kbytes) to perform the benchmark. I tried with other kind of files with same size and the ratio sometimes was about 1:500 (obviously not always 1:3600). I think the ratio could be less than 1:3600 (the average estimated by the author) with some kinds of files (maybe critical/bad for function chardet.detect?).
Also I tried his bench script with some modifications like:
- variable
do_times
set to 10 instead 100 (to speed up the calculation) - same txt file as input for function
c(c)hardet.detect
but I executed a benchmark for 5 files, with size x1, x2, x4, x8 and x32 (I created them simply by creating a copy of the original file with 330kbyte size and by adding the same content inside the file 2 times, 4 times, and so on..., in order to have a little set of input data with a geometric progression as size, as an example).
I noticed that, by performing the test with my computer, with some available ram, cpu, etc...,:
-
with chardet : the time cost is quite near to follow a linear law , with an estimated time complexity of about O(k*n) where k approximately equal to 5e-4
-
with cchardet : same as chardet but with k approximately equal to 9e-8
This thing surprises me enough, I didn't think there was such a big difference between pure-python and c for the function chardet.detect, that is c is a few thousand times faster than pure python (my test using the author's bench script is however a very simplified test and not good enough to estimate the time complexity with high precision and as much as possible independently from the computer in use).
But I still believe that in real life with Pythonista the user can easily write scripts that he/she can test anywhere (just have the idevice in own hand) and then can export to a computer the script created with Pythonista for large input files. Pythonista in this case helps a lot in preparing the calculation script (I mean, you can verify it with little input data by using Pythonista and pure-python libraries installed via pip) and then you can use all the power of a computer when working with large input data.
Concluding, for me the only limit with Pythonista and pure python libraries is not the performance (in most cases the user should use Pythonista like a platform to write good code, without worrying at first to execute the script with large inputs), but the impossibility to install not pure python libraries that haven't pure-python alternatives on the web.
Bye
- variable
-
@lpl Ok, fair enough. Let's close the topic until someone completely refactors lxml, then you can ask again, k?
-
@JonB lxml is just an example I used to prove my words.