omz:forum

    • Register
    • Login
    • Search
    • Recent
    • Popular

    Welcome!

    This is the community forum for my apps Pythonista and Editorial.

    For individual support questions, you can also send an email. If you have a very short question or just want to say hello — I'm @olemoritz on Twitter.


    Web scraping

    Pythonista
    bs4 url beautifulsoup requests web scraping
    2
    5
    4120
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • sulcud
      sulcud last edited by sulcud

      I made this python script for web scraping, it can map a web site in a dictionary for example:

      {url:[url_type,{url1_in_url:[url1_in_url_type,{...}],url2_inurl:[...],...}]}
      

      if someone also want to download all files in that web_site only need specify it:
      with the variable 'descargar' and set it to True

       url = 'https://en.wikipedia.org/wiki/Roy_Clark'
       descargar = True
       profundidad = 2
       archivo = 'clark.json'
       s = Scraper()
       s.lineal(url,profundidad,descargar,archivo)
      

      GitHub

      I want to implement some threading algorithm to speed up the analysis.
      Note:

      it sometimes have an error when downloading files.

      mikael 1 Reply Last reply Reply Quote 0
      • mikael
        mikael @sulcud last edited by

        @sulcud, would be interesting to know if you have considered the pros and cons of threading vs. asyncio, and why you would pick one or the other.

        sulcud 1 Reply Last reply Reply Quote 0
        • sulcud
          sulcud @mikael last edited by

          @mikael I try to implement the async function, really I don’t know if i do it well, now it download all more faster than before and in other hand i also correct the link extraction function because some times (most of the time ☹️) the function only outputs 20-50 urls, now it can extract all or some number near to all of the links in the page, I also make a setup.py file but I truly don’t know if it works.

          Now the way to use it is:

          from scrapthor import scrap
          url=“some url”
          scrap(url)
          

          Please. can you check it?

          mikael 1 Reply Last reply Reply Quote 0
          • mikael
            mikael @sulcud last edited by

            @sulcud, your code still looks serial to me. I think you need aiohttp for this - check e.g. this tutorial.

            sulcud 1 Reply Last reply Reply Quote 0
            • sulcud
              sulcud @mikael last edited by

              @mikael WOW with that package the speed increase a lot, thanks, now I know the real power of async programming

              1 Reply Last reply Reply Quote 0
              • First post
                Last post
              Powered by NodeBB Forums | Contributors