omz:forum

    • Register
    • Login
    • Search
    • Recent
    • Popular

    Welcome!

    This is the community forum for my apps Pythonista and Editorial.

    For individual support questions, you can also send an email. If you have a very short question or just want to say hello — I'm @olemoritz on Twitter.


    Finding CSS files inside HTML source

    Pythonista
    regex python 2
    2
    2
    1868
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • low
      low last edited by low

      I am trying to make a script that caches websites so I can easily scrape data without have to use proxies (I scrape enough to the point where my traffic would get blocked). My first step is using urllib2 to get the page source:

      cc = urllib2.urlopen(args.site)
      source = cc.read()
      

      Once I have the page source, I need to look through the source code to find any CSS files that are linked relatively to the site. The two tags I would be looking for are <link, href=, and stylesheet. The "rel" is where the CSS is relatively linked, the stylesheet proves that the file is CSS, and directly after the hyper-refrence (href) is the name of the css file in quotes that I need.
      An example;

      <link rel="stylesheet" type="text/css" href="main.css" />
      

      How could I get this information?
      edit: I need to pick this css file out of all the HTML code that the webpage has.

      1 Reply Last reply Reply Quote 0
      • kristof_be
        kristof_be last edited by

        I would personally use Requests and BeautifulSoup (both modules are included by default with Pythonista).

        Furthermore, I would set up the HTTP request to use a modified User-Agent value, in order to hide that you're running this from a script, and pose as a regular web browser (note that this is not foolproof, but it helps).

        If you only need the links to the CSS, you can instruct BeautifulSoup to parse only to links, thus speeding up the process, by using the SoupStrainer class (more info here).

        Hope this helps.

        1 Reply Last reply Reply Quote 0
        • First post
          Last post
        Powered by NodeBB Forums | Contributors