BeautifulSoup or Regex

Rothrock42

So I've got to pull some text out of some html pages. The parts are marked like so:

'<!- TERM1_[ ->Html data I want in here<!- ]_TERM1 ->'

And there are 4 or 5 different terms in different areas of the page. I'm very new to all of this so I'm not quite seeing the best way to pull that out. Since the terms aren't valid html tags it looks like BS4 won't help? Use a regular expression and group the part in between?

Any guidance or suggestions would be very helpful. Thanks.

ccc

Off the top of my head (without the benefit of looking at the actual HTML) I would recommend that you use BS4 to grab the strings that you believe will contain your content and then use regex or partition() as your scalpel to grab the exact target text. I am not a regex guru so I always try to use partition() first and then fall back to regex if needed.

BS4 is great for rapidly finding the needle in the HTML haystack. Regex shines at parsing strings.

Rothrock42

@CCC thanks. I'll check out partition() and see if it will do what I want before I move on to regex groups. (I'm totally baffled by regex. The more I use it the less I understand.)

Also I can't figure out how to get the actual code I have to show. I thought three ticks would make it stay the way it was typed, but it displays all messed up.

JonB

Try copying the following:

``` your code here ```

The back tics should be on its own line I think. Alternatively,

Four spaces at start of line starts code mode.

If you post the html you are trying to parse, someone can provide the regexp.

ccc

To get syntax highlighting of python code, start with a blank line followed by three backticks immediately followed by the word 'python' then your code terminated by three backticks. Like this:

```python
def add_one(n):
    return n+1


If your content is long then put it in GitHub and provide the URL to it in your post here.

JonB

By the way, the following

 <!--This is a comment. -->

Is an html comment. You can get all html comments using:

comments = soup.findAll(text=lambda text:isinstance(text, bs4.Comment))

(Or, if you know the comment is withina specific tag, you can traverse to that tag rather than using soup)

Rothrock42

Thanks. Everyone. Let's see if this works.

    <!--- OBJECTIVE_[ --->Very variable amount of html and stuff<!--- ]_OBJECTIVE --->

Ah the secret is to have four spaces before the backticks too.

Yes they are comments, but there are four different pairs of them (objective, description, duration, course_num) and I need to get the content between matching pairs. There are also other comments that I won't want. So I'm filing away bs4.Comment for later, but I don't think that will help me here.

I was able to figure out the regex, here it is if anyone cares. I'm not seeing the python markdown working, but I hope it formats correctly.

    terms=['OVERVIEW','OBJECTIVE','DURATION','COURSE_NUM']
    for t in terms:
        pattern = re.compile('<\!\-+\s*'+t+'_[\s*\-+>(.+?)<\!-+\s*]_'+t,re.DOTALL)
        # continue on and deal with the grouped data

Thanks everyone for the help. Learning more and more each day.

Edit: Evidently it isn't apostrophe but actual back ticks.

ccc

The secret is actually a blank line before the backticks.