Scraping

aiden

Hello everybody! My name is Aiden, and I'm new to python, despite coding itself. I have been learning for only a couple days now, but with the mindset to learn all about web Scraping. Because of this, here is a small script I wrote the scrapes you ip from a website's (in the case disneyland's website) logs and then both prints and says it out to you.

import urllib
import speech
getsrc = urllib.urlopen('https://disneyland.disney.go.com/calendars/day/')
read = getsrc.read()
print(read[1751:1766])
rate = '1'
speech.say('Your IP is '+read[1751:1766], rate)

Thank you for reading through this whole thing, and please comment any suggestions!

JonB

You should take a look at bs4. The danger with what you have shown is that tomorrow the website might change in a small way and your script won't work anymore. It is better to try to use named elements to anchor where you are. Rarely would you pick a range from a string -- better to use a regexp or other pattern matching where named elements don't drill down enough. There have been a few threads on using bs4 for specific purposes.

aiden

Ok, thank you! I will test out BS4 and then update this thread. Thanks for the recommendation :)

ccc

@JonB I really like using BS4 but perhaps it is not the right tool for this particular job.

Given that the text in question is on about the 20th line of the second javascript in the webpage, is BS4 the best tool? BS4 will be great for finding the second SCRIPT block but is it helpful for parsing into that script? Perhaps using str.partition('"ip"') is more direct. Perhaps I am missing some capabilities that are in BS4.

import requests

url = 'https://disneyland.disney.go.com/calendars/day/'

for line in requests.get(url).text.splitlines():
    _, ip_in_double_quotes, rest = line.partition('"ip"')
    if rest:
        print('Your IP is {}.'.format(rest.split('"')[1]))