omz:forum

    • Register
    • Login
    • Search
    • Recent
    • Popular

    Welcome!

    This is the community forum for my apps Pythonista and Editorial.

    For individual support questions, you can also send an email. If you have a very short question or just want to say hello — I'm @olemoritz on Twitter.


    BeautifulSoup or Regex

    Pythonista
    3
    8
    5983
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Rothrock42
      Rothrock42 last edited by

      So I've got to pull some text out of some html pages. The parts are marked like so:

      '<!- TERM1_[ ->Html data I want in here<!- ]_TERM1 ->'

      And there are 4 or 5 different terms in different areas of the page. I'm very new to all of this so I'm not quite seeing the best way to pull that out. Since the terms aren't valid html tags it looks like BS4 won't help? Use a regular expression and group the part in between?

      Any guidance or suggestions would be very helpful. Thanks.

      1 Reply Last reply Reply Quote 0
      • ccc
        ccc last edited by

        Off the top of my head (without the benefit of looking at the actual HTML) I would recommend that you use BS4 to grab the strings that you believe will contain your content and then use regex or partition() as your scalpel to grab the exact target text. I am not a regex guru so I always try to use partition() first and then fall back to regex if needed.

        BS4 is great for rapidly finding the needle in the HTML haystack. Regex shines at parsing strings.

        1 Reply Last reply Reply Quote 0
        • Rothrock42
          Rothrock42 last edited by

          @CCC thanks. I'll check out partition() and see if it will do what I want before I move on to regex groups. (I'm totally baffled by regex. The more I use it the less I understand.)

          Also I can't figure out how to get the actual code I have to show. I thought three ticks would make it stay the way it was typed, but it displays all messed up.

          1 Reply Last reply Reply Quote 0
          • JonB
            JonB last edited by

            Try copying the following:

            ``` your code here ```
            

            The back tics should be on its own line I think. Alternatively,

            Four spaces at start of line starts code mode.  
            

            If you post the html you are trying to parse, someone can provide the regexp.

            1 Reply Last reply Reply Quote 0
            • ccc
              ccc last edited by

              To get syntax highlighting of python code, start with a blank line followed by three backticks immediately followed by the word 'python' then your code terminated by three backticks. Like this:

              ```python
              def add_one(n):
                  return n+1
              
              
              If your content is long then put it in GitHub and provide the URL to it in your post here.
              1 Reply Last reply Reply Quote 0
              • JonB
                JonB last edited by

                By the way, the following

                 <!--This is a comment. -->
                

                Is an html comment. You can get all html comments using:

                comments = soup.findAll(text=lambda text:isinstance(text, bs4.Comment))
                

                (Or, if you know the comment is withina specific tag, you can traverse to that tag rather than using soup)

                1 Reply Last reply Reply Quote 0
                • Rothrock42
                  Rothrock42 last edited by

                  Thanks. Everyone. Let's see if this works.

                      <!--- OBJECTIVE_[ --->Very variable amount of html and stuff<!--- ]_OBJECTIVE --->
                  

                  Ah the secret is to have four spaces before the backticks too.

                  Yes they are comments, but there are four different pairs of them (objective, description, duration, course_num) and I need to get the content between matching pairs. There are also other comments that I won't want. So I'm filing away bs4.Comment for later, but I don't think that will help me here.

                  I was able to figure out the regex, here it is if anyone cares. I'm not seeing the python markdown working, but I hope it formats correctly.

                      terms=['OVERVIEW','OBJECTIVE','DURATION','COURSE_NUM']
                      for t in terms:
                          pattern = re.compile('<\!\-+\s*'+t+'_[\s*\-+>(.+?)<\!-+\s*]_'+t,re.DOTALL)
                          # continue on and deal with the grouped data
                  

                  Thanks everyone for the help. Learning more and more each day.

                  Edit: Evidently it isn't apostrophe but actual back ticks.

                  1 Reply Last reply Reply Quote 0
                  • ccc
                    ccc last edited by

                    The secret is actually a blank line before the backticks.

                    1 Reply Last reply Reply Quote 0
                    • First post
                      Last post
                    Powered by NodeBB Forums | Contributors