omz:forum

    • Register
    • Login
    • Search
    • Recent
    • Popular

    Welcome!

    This is the community forum for my apps Pythonista and Editorial.

    For individual support questions, you can also send an email. If you have a very short question or just want to say hello — I'm @olemoritz on Twitter.


    Scraping

    Pythonista
    3
    4
    2732
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • aiden
      aiden last edited by

      Hello everybody! My name is Aiden, and I'm new to python, despite coding itself. I have been learning for only a couple days now, but with the mindset to learn all about web Scraping. Because of this, here is a small script I wrote the scrapes you ip from a website's (in the case disneyland's website) logs and then both prints and says it out to you.

      import urllib
      import speech
      getsrc = urllib.urlopen('https://disneyland.disney.go.com/calendars/day/')
      read = getsrc.read()
      print(read[1751:1766])
      rate = '1'
      speech.say('Your IP is '+read[1751:1766], rate)
      

      Thank you for reading through this whole thing, and please comment any suggestions!

      1 Reply Last reply Reply Quote 0
      • JonB
        JonB last edited by

        You should take a look at bs4. The danger with what you have shown is that tomorrow the website might change in a small way and your script won't work anymore. It is better to try to use named elements to anchor where you are. Rarely would you pick a range from a string -- better to use a regexp or other pattern matching where named elements don't drill down enough. There have been a few threads on using bs4 for specific purposes.

        1 Reply Last reply Reply Quote 0
        • aiden
          aiden last edited by

          Ok, thank you! I will test out BS4 and then update this thread. Thanks for the recommendation :)

          1 Reply Last reply Reply Quote 0
          • ccc
            ccc last edited by ccc

            @JonB I really like using BS4 but perhaps it is not the right tool for this particular job.

            Given that the text in question is on about the 20th line of the second javascript in the webpage, is BS4 the best tool? BS4 will be great for finding the second SCRIPT block but is it helpful for parsing into that script? Perhaps using str.partition('"ip"') is more direct. Perhaps I am missing some capabilities that are in BS4.

            import requests
            
            url = 'https://disneyland.disney.go.com/calendars/day/'
            
            for line in requests.get(url).text.splitlines():
                _, ip_in_double_quotes, rest = line.partition('"ip"')
                if rest:
                    print('Your IP is {}.'.format(rest.split('"')[1]))
            
            1 Reply Last reply Reply Quote 0
            • First post
              Last post
            Powered by NodeBB Forums | Contributors