Regarding NLTK on Pythonista
Hi, I'm new to both Pythonista and to this forum. I'm slightly new to Python, spending the last 15 years or so in the Enterprise Java world. I am very interested in AI in general and computational linguistics/natural language analysis at the hobby or after work interest level.
I have seen the discussions on/about NLTK in this forum which I too am very interested in having support for. Since it is a pure Python library I thought to try it out in Pythonista. I put together a little test harness to try it out. You can see it or grab it here - NLTK Test script
The reason I post this (my first post to this forum) here in New Discussion rather than in the Share Code section is that the general discussion about NLTK is here in this section of the forum.
What I found from my little test script of interest was:
1 - You can run the NLTK data sets downloader, in non-graphical,commandline/ interactive mode right from Pythonista and that's how I downloaded my data
2 - You don't need to download all of the hundreds of copora/data, only the sets you are interested in and most are a few megabytes only. The ENTIRE set when unarchived is about 1.5 gig. I hava a 120 gig iPad 4 so this was not really an issue.
3 - You can put the data sets anywhere you like provided you set the NLTKDATA environment variable to the location of nltk_data. That means even on a non-jailbroken device there should be somewhere you can put them. For my test, I used the Pythonista app itself to locate the data since my device is jailbroken.
4 - I noticed that I only needed to run a script --that sets the NLTKDATA environment variable -- once. On subsequent times I could comment out that section of my test case and NLTK was able to still find the data. I even shut down the Pythonista process and started it again and ran the script without the explicit setting the variable and it still worked. This leads me to believe that NLTK is persisting the data path somewhere such that it seems feasible to have perhaps a separate script in your library just to set the data path for when you change or add to nltk_data in a new location.
5 - I tried some other more involved sample scripts that used the same (Brown) data set and I found the load and execute times very reasonable. I did not see any of the 30s load times described elsewhere in the forum.
6 - Although numpy and scipy -- and a number of others - support clearly extends the utility of NLTK, and we currently can only run pure Python libs, even NLTK itself provides me with the tools I need to construct some serious applications in Pythonista rather than mere toy apps. Numpy and SciPy will be welcome additions however.
I think the iPad 4 wasn't out yet when I experimented with this, so performance on recent devices is probably much better than what I saw.
I guess it might be interesting to build some sort of NLTK installer script for Pythonista (that perhaps downloads common corpora etc. as well and configures the data path correctly)...
Aside: While NumPy will be part of the next update, it's very unlikely that I'll be able to get SciPy to work. It contains a lot on Fortran code, and I frankly have no idea how to cross-compile that for iOS...
NLTK + Numpy would be a great combination for a lot of other general ai besides just nl.
As far as your comment about "some sort of NLTK installer script for Pythonista", I wonder to what degree the NLTK downloader itself is extensible? It's what, as I said, has a non-graphical version and is, in fact, the way I got the data, specifically the "brown" corpus. See my wrapper for the down-loader at https://gist.github.com/swosnick/10702869
It's very simple to use but I wonder if it is programmable itself. That way such a method doesn't have to start from scratch, or reduplicate available and open-sourced code. I am investigating that and if I find anything I will report back.
Great work ltddev! And omz, a tested and documented NLTK installer script sounds like a great solution! I think many of us would like NLTK accessible in Pythonista, but understand that it doesn't make sense as a part of the standard install. Please add my vote to a solution like this.
So... Who is willing to volunteer to create the github repository (not a gist!) and merge in pull requests so this community can collaborate to build "a tested and documented NLTK installer script"?
Now just to get a nosql db and I could play with data mining on ipad. Anyone tried to get this running CodernityDB
As I said, I think a good place to start pulling the data sets or corpora is with the NLTK.downloader module. Once you have downloaded the NLTK module itself and sucked it into Pythonista and can start to use it, the NLTK.downloader module has a fairly rich API to search, list and download selected individual corpora or download logical groupings of corpora. For more information about what I'm getting at see the API doc for the script able downloader here:
@Avisual, I have played around with CodernityDB ironically with NLTK. You have challenged my interest to demonstrate NLTK + Corpora + CodernityDB all from Pythonista. I will report back :)
It appears straightforward to run CodernityDB on Pythonista because, like NLTK, it is pure Python and in the case of CodernityDB, there are absolutely no 3rd party dependencies. See my test code here, based on one their examples meant to highlight easy support for insert/save/store. It stores 15 objects in a database: https://gist.github.com/swosnick/11065623
Before I attempt to get NLTK working, has anyone already done what @omz mentioned above ("I guess it might be interesting to build some sort of NLTK installer script for Pythonista (that perhaps downloads common corpora etc. as well and configures the data path correctly)...")?
On a non-jailbroken device has anyone figured out how to corrextly set the data paths to allow u to download and run brown correctly