Scraping Test

reefboy1

I am new to scraping and I made this


import bs4, requests

def get_beautiful_soup(url):
    return bs4.BeautifulSoup(requests.get(url).text)

soup = get_beautiful_soup('http://www.python.org')

print(soup.prettify())

Anything else that I can do with this?

ccc

Print all http and https links on the page sorted alphabetically with no duplicates...

links = []
for anchor in soup.find_all('a'):  # tags like: <a href='http...'
    try:
        link = anchor['href']
        if link.startswith('http'):
            links.append(link)
    except KeyError:
        pass
print('\n'.join(sorted(set(links))))

The bs4 documentation examples are quite fun to read and run in Pythonista.

ccc

Another example... We just want the text of the webpage without all the HTML tags, etc.

print('=' * 40)  # print the text of the body of the webpage without all  the html junk
print(soup.body.get_text())  # get the body of the soup, and then get only the text of that
print('=' * 40)  # contains lots of blank lines... let's get rid of the blank lines
for line in soup.body.get_text().splitlines():
    if line.strip():
        print(line)
print('=' * 40)  # another way to write the three previous lines uses a list comprehension and str.join()
print('\n'.join([x for x in soup.body.get_text().splitlines() if x.strip()]))
print('=' * 40)  # contains lots of lines that have indentation... left justify all lines
for line in soup.body.get_text().splitlines():
    if line.strip():
        print(line.lstrip())
print('=' * 40)  # rewritten using a list comprehension and str.join()
print('\n'.join([x.lstrip() for x in soup.body.get_text().splitlines() if x.strip()]))

ccc

Does someone else want to do the next one?

Print out the URLs for all the images that appear on the page.

brumm

find_all_pictures.py

ccc

Really nice, @brumm...

It works even better if you set the url to http://amazon.com or http://imdb.com. The Python.org page handles images differently but I like your solution best.

reefboy1

I have made something new! I will post it soon

brumm

download_pictures

reefboy1


import bs4, requests
import webbrowser
import console

def get_beautiful_soup(url):
    return bs4.BeautifulSoup(requests.get(url).text)
a = raw_input('url to check. url structure (http://www.url.com or net or gov or org)     ')
console.clear()
soup = get_beautiful_soup(a)
webbrowser.open('http://google.com/gmail')

print(soup.prettify())

JonB

@brumm
filname=posix.basename(urlparse.urlsplit(url)[2])

Might be a little more robust way to get the filename. Technically the url could contain a query fragment, etc, which gets stripped out by urlsplit.

dgelessus

No, bad! Don't use the posix module directly. It's an undocumented internal module that provides implementations for some of the os module's functions on Unix-based systems. In almost all cases you'll want to use os, which provides all the functions that posix does, but is guaranteed to be available on all platforms.

reefboy1

What else can I do with bs4? I only know how to get html

ccc

So far we have demonstrated how to get:

The body text only with no markup
The URLs that are linked to
Download all the images

What else do web pages contain? Sounds, Movies, Forms, Lists, Files, others?

What else are you interested to get from web pages? Music lists? Tour schedules for your favorite band? Local weather forecast? Snow depth at local ski slopes? Wave heights at various beaches? What info does reefboy1 want to scrape off of webpages?

Have you gone thru the bs4 documentation examples yet?

reefboy1

Yes I have read them, but I'm not sure I fully understand them. Wether forecast would be nice

reefboy1

Ps: I don't know HTML

wradcliffe

The current examples have been great for picking out generic things on web pages but the typical use case needs to understand the page format and then get something specific off the page like a table of data. You typically need to use the pretty print facility to view the page contents and then issue a series of tedious calls to march down through the structure and pull out the text. The existing bs4 docs and examples are fine for that.

You can get real far with a very specific problem and code up something that grabs what you need off a page using these hand coded methods easily but the code is proabably going to be throwaway and break as soon as the target url changes the format of their pages.

However ... this is NOT what a true "ScreenScraper" app does. True ScreenScrapers build on bs4 and provide templates for various kinds of web pages and automatically strips out all the crap like inline ads and sidebars and all the "visual cruft" while returning only the "content". The best ones can identify the main images that the page is refering to and the body of the text that is the main subject of the page. There is a project/product called "readability" that did this that was ported to Python but they stopped updating it when it became a "commercial" a product. You can still find the early Python code though that uses a version of bs at: https://github.com/gfxmonk/python-readability

The big players in this area are obviously the search engine companies like Google, Bing, Facebook that have extremely sophisicated methods for disecting web pages and getting at the real "content". The rest of the universe seems to have moved on to using an web API instead that just hands you the required data using a special URL syntax.

brumm

@JonB: Great, thank you.

JonB

Dgelessus,
My bad... that should be os.path.basename instead. I wasn't sure how windows machines would handle forward slashes in urls, but a quick test shows it works properly.