omz:forum

    • Register
    • Login
    • Search
    • Recent
    • Popular

    Welcome!

    This is the community forum for my apps Pythonista and Editorial.

    For individual support questions, you can also send an email. If you have a very short question or just want to say hello — I'm @olemoritz on Twitter.


    Screen Scraping

    Pythonista
    3
    4
    6446
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • reefboy1
      reefboy1 last edited by

      What is screen scraping? From what I know it's like getting info from some database(i think). And also how can I screen scrape? Thanks for your answers

      1 Reply Last reply Reply Quote 0
      • ccc
        ccc last edited by

        Screen Scraping is the art and science of:

        1) getting all the text from a computer display (terminal, webpage, etc.) and then
        2) selecting out only those data fields of interest for storage or further processing.
        

        It used to be about getting data from terminal displays but these daze it is mostly about scraping data off of web pages. The Pythonista tools that I prefer for web scraping are requests (for getting all the HTML of a webpage) and beautiful soup 4 (selecting out only those data fields of interest). bs4 is complicated but it is supercool once you get the hang of it.

        Here are two recent examples of web scraping. They follow the model:

        import bs4, requests
        
        def get_beautiful_soup(url):
            return bs4.BeautifulSoup(requests.get(url).text)
        
        soup = get_beautiful_soup('http://omz-forums.appspot.com/pythonista')
        print(soup.prettify())
        # See: http://www.crummy.com/software/BeautifulSoup/bs4/doc for all the things you can do with the soup.
        

        As you can see by looking at the output, the harder part is selecting out only those data fields of interest. ;-)

        If bs4 is too complicated for your purposes, you can do html = requests.get(url).text and then try using str.find() and str.partition() or Python's regular expressions module, re as a poor man's soup. Happy scraping.

        1 Reply Last reply Reply Quote 1
        • reefboy1
          reefboy1 last edited by

          Cool! Thanks for the response

          1 Reply Last reply Reply Quote 0
          • scraperhunk
            scraperhunk last edited by

            This post is deleted!
            1 Reply Last reply Reply Quote 0
            • First post
              Last post
            Powered by NodeBB Forums | Contributors