omz:forum

Rothrock42

So I've got to pull some text out of some html pages. The parts are marked like so:

'<!- TERM1_[ ->Html data I want in here<!- ]_TERM1 ->'

And there are 4 or 5 different terms in different areas of the page. I'm very new to all of this so I'm not quite seeing the best way to pull that out. Since the terms aren't valid html tags it looks like BS4 won't help? Use a regular expression and group the part in between?

Any guidance or suggestions would be very helpful. Thanks.

Rothrock42

Thanks JonB. I'm still very new to python and wasn't sure if there was some "obvious" approach that I was missing.

Rothrock42

I've been playing with the sample scene file and the touches. I wanted to give each touch a random custom color. I notice that each touch gets a TouchID and was hoping to add the color to that, but I can't seem to do it.

Do I need to make my own dictionary or something like that to track the touch IDs and associate the color?

Also any details on how the scene is tracking a touch blob from frame to frame? Just curious there...

Rothrock42

@ihf This was typing at the python prompt, right? Happened to me to on the iPhone. It works fine when typing into a py document.

Rothrock42

Thanks. Everyone. Let's see if this works.

    <!--- OBJECTIVE_[ --->Very variable amount of html and stuff<!--- ]_OBJECTIVE --->

Ah the secret is to have four spaces before the backticks too.

Yes they are comments, but there are four different pairs of them (objective, description, duration, course_num) and I need to get the content between matching pairs. There are also other comments that I won't want. So I'm filing away bs4.Comment for later, but I don't think that will help me here.

I was able to figure out the regex, here it is if anyone cares. I'm not seeing the python markdown working, but I hope it formats correctly.

    terms=['OVERVIEW','OBJECTIVE','DURATION','COURSE_NUM']
    for t in terms:
        pattern = re.compile('<\!\-+\s*'+t+'_[\s*\-+>(.+?)<\!-+\s*]_'+t,re.DOTALL)
        # continue on and deal with the grouped data

Thanks everyone for the help. Learning more and more each day.

Edit: Evidently it isn't apostrophe but actual back ticks.

Rothrock42

@CCC thanks. I'll check out partition() and see if it will do what I want before I move on to regex groups. (I'm totally baffled by regex. The more I use it the less I understand.)

Also I can't figure out how to get the actual code I have to show. I thought three ticks would make it stay the way it was typed, but it displays all messed up.

Rothrock42

Spoke too soon. I changed line 456 in nltk.text.py to

from nltk.draw.dispersion import dispersion_plot

And now the dispersion_plot works. Not sure what in the draw package requires Tkinter, but that wasn't it.

Rothrock42

Playing with it some more I've found that it also wants Tkinter. Depending on which packages you load you might get a warning:

Documents/site-packages/nltk/draw/__init__.py:14: UserWarning: nltk.draw package not loaded (please install Tkinter library).
warnings.warn("nltk.draw package not loaded "

So text#.dispersion_plot() won't work, but oddly FreqDist(text#).plot(#) does. Evidently plot uses Matplotlib, but dispersion_plot() uses Tkinter.

Rothrock42

Using the console I used

import os
print os.path.expanduser('~/')

To get the path to the top level of my pythonista install. You can then add on whatever directories you have.

Rothrock42

I think I used pythonista to open the data.py file and set it by hand. I think I added it as a path in the list on line 66. Don't know if that is the proper way to do it, but I figured I might have to get a bit hacky to make it work on an iPad.

Rothrock42

@Rothrock42

Latest posts made by Rothrock42