omz:forum

    • Register
    • Login
    • Search
    • Recent
    • Popular

    Welcome!

    This is the community forum for my apps Pythonista and Editorial.

    For individual support questions, you can also send an email. If you have a very short question or just want to say hello — I'm @olemoritz on Twitter.


    Capture Specific Webpage text using regex/searchHTML and save as new textile

    Editorial
    5
    9
    5694
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • rsayeed
      rsayeed last edited by

      I wanted to know if there are any good scripts in use in Python that could demonstrate –Capture Specific Webpage text using regex/searchHTML and save as new textile in editorial? Before i embark on making on?

      I couldn't find one using search

      1 Reply Last reply Reply Quote 0
      • ccc
        ccc last edited by

        I avoid regex like the plague that it is. You might check out the get_module_version_from_pypi() function in https://github.com/cclauss/pythonista-module-versions/blob/master/pythonista_module_versions.py

        It uses BeautifulSoup 4 and requests that are both built into Pythonista (and I hope Editorial too).

        import bs4, requests
        
        def get_soup(url):
            return soup = bs4.BeautifulSoup(requests.get(url).text, 'html5lib')
        
        1 Reply Last reply Reply Quote 2
        • cook
          cook last edited by

          I agree with @ccc
          I used to use regex to do that sort of thing. You can get it to work, but it takes a lot of effort.
          So I embarked to learn how Beautiful Soup works, and now that I've learned it a bit it's way easier and much more effective for many things (not just search).

          Here's a little example from cheese.com. (I like cheese...not that I check this website, but cheese is good.)

          # coding: utf-8
          import requests
          from bs4 import BeautifulSoup
          
          url = 'http://www.cheese.com'
          
          soup = BeautifulSoup(requests.get(url).text)
          
          print soup.find('div', id='abstract') #find one div with id 'abstract' and print
          
          1 Reply Last reply Reply Quote 0
          • Webmaster4o
            Webmaster4o last edited by

            BeautifulSoup is great. @rsayeed: Read this for more information.

            1 Reply Last reply Reply Quote 0
            • Phuket2
              Phuket2 last edited by

              Just as a side note. I think it's also worth checking to see if a webpage/site has a json interface. Of course many are protective of their data. But it's seems like more and more sites are offering json data. Can save a lot of work and fragile scripts

              1 Reply Last reply Reply Quote 1
              • cook
                cook last edited by

                @Phuket2 is right....much easier!
                and also API's....!

                1 Reply Last reply Reply Quote 0
                • Phuket2
                  Phuket2 last edited by

                  Actually, I was just looking around. I found this site. But there are many sites like this
                  https://www.publicapis.com/

                  The below is to get info on Formula 1 Drivers. But if you look at the API, you can get everything about F1. It's so nice. Could easily write an F1 App with this API. I chopped this up. In the full code I am caching to disk etc.

                  But there is another interesting line that's commented out. The url to a Github repo for a Pantone color list. I just left there as a example. I didn't know before how easy it was to do that. The sample code does not deal with that site. But it's real interesting that we could publish data in our on repos that we all could all use.

                  Anyway, hope I haven't convoluted the conversion. But the possibilities are exciting.

                  import json, requests
                  
                  def get_json_data(url):
                  	r = requests.get(url)
                  	print('r == ', r)
                  	
                  	if r.status_code == 200:
                  		return r.json()
                  		
                  if __name__ == '__main__':
                  	#url = 'https://raw.githubusercontent.com/teelaunch/pms-pantone-color-chart/master/params.json'
                  	url = 'http://ergast.com/api/f1/2016/drivers.json'
                  	r = get_json_data(url)
                  	drivers =  r['MRData']['DriverTable']['Drivers']
                  	print(drivers)
                  
                  1 Reply Last reply Reply Quote 0
                  • rsayeed
                    rsayeed last edited by

                    Wow, this is great. You got a pretty active community here!

                    I'm basically trying to scrape-off some text from webpages and no, theres no JSON there. I'll surely look at this Beautiful Soup thing. THis is the time i heard it.

                    So my plan is:

                    I'm probably gonna use Pythonista's action extension.

                    1. Browse the page on Safari (in iOS)
                    2. execute the Pythonista extension (assuming I make it).
                    3. Get the result Text
                    4. safe/append it to a file in dropbox.
                    1 Reply Last reply Reply Quote 1
                    • cook
                      cook last edited by

                      @rayseed
                      Do you want to scrape text off of any old website? Or a particular one?

                      If it's for random sites, then even using Beautiful soup will be a little tough. Well, anything would be tough for that matter :)

                      Usually site developers use some constants throughout their code - but going from different site to different site it's not constant and that's where you would need flexibility in your implementation of Beautiful Soup.

                      But, often, headings are in heading tags and text content is in p tags and etc. So...a generic scraper is possible - but may not get everything, or most likely you'll get more than what you want.

                      Example:

                      # coding: utf-8
                      import requests
                      from bs4 import BeautifulSoup
                      
                      url = 'http://www.cheese.com/'
                      soup = BeautifulSoup(requests.get(url).text)
                      
                      for i in soup.find_all(lambda tag: tag.parent.name == 'body'):
                      	print i.text.strip()
                      	
                      #gives a lot of junk...
                      
                      1 Reply Last reply Reply Quote 0
                      • First post
                        Last post
                      Powered by NodeBB Forums | Contributors