Finding CSS files inside HTML source
I am trying to make a script that caches websites so I can easily scrape data without have to use proxies (I scrape enough to the point where my traffic would get blocked). My first step is using urllib2 to get the page source:
cc = urllib2.urlopen(args.site) source = cc.read()
Once I have the page source, I need to look through the source code to find any CSS files that are linked relatively to the site. The two tags I would be looking for are <link, href=, and stylesheet. The "rel" is where the CSS is relatively linked, the stylesheet proves that the file is CSS, and directly after the hyper-refrence (href) is the name of the css file in quotes that I need.
<link rel="stylesheet" type="text/css" href="main.css" />
How could I get this information?
edit: I need to pick this css file out of all the HTML code that the webpage has.
I would personally use
BeautifulSoup(both modules are included by default with Pythonista).
Furthermore, I would set up the HTTP request to use a modified
User-Agentvalue, in order to hide that you're running this from a script, and pose as a regular web browser (note that this is not foolproof, but it helps).
If you only need the links to the CSS, you can instruct
BeautifulSoupto parse only to links, thus speeding up the process, by using the
SoupStrainerclass (more info here).
Hope this helps.