Welcome!
This is the community forum for my apps Pythonista and Editorial.
For individual support questions, you can also send an email. If you have a very short question or just want to say hello — I'm @olemoritz on Twitter.
Quick hackish html-book documentation scraper for offline reading
-
You know what's annoying? When people post stuff online as html-books, like with a table of contents page and then a bunch of linked sub-pages with all the content. Lots of documentation, in particular, is organized that way (example: http://tldp.org/LDP/abs/html/ ) and it drives me nuts because I like to read stuff offline, like on my iPad on airplanes.
Solution: a script to crawl such pages starting from the ToC and scrape unique URLs linked therein, one level deep, and append them to one big file that can then be read offline.
Gist: https://gist.github.com/paultopia/460acfda07f9ca7314e5
Takes URL of ToC page from raw_input, deposits a html file in pythonista's internal file system. From there, do with it what you will---pass it to a Dropbox upload script, pass it to docverter or something to make it into a PDF, whev. You can also use pythonista export function to get it into Dropbox or another app (like PDF converter) the easy way, but, for some odd reason, the export function only works when the filename ends with .py (why is this anyway?), so you'll have to edit the filename then edit back.
Caveats: assumes all links are relative urls on same server, and has exactly zero validation to check that (easy to add, I just haven't bothered). Will probably crash if that assumption is violated. Also produces invalid html but not in any way that will bother any browser or converter. Finally, assumes that documents to scrape are structured with content in vanilla html, no Ajax calls or the like.
-
The part where it doesn't check that URLs are relative does seem like an issue, you made an entire web scraper ;) This sounds really cool though.
-
Heh yeah I'm about to add a little validation just for sanity-preserving purposes.
I also just updated so it can handle ToC pages other than index.html or equivalent.
-
Improved! There's a new and much more effective version of the script that:
- Confirms links come from same domain
- Better handles URLS relative to root rather than to ToC folder.
Gist: https://gist.github.com/paultopia/02ca124a111a70faf174
-
@paultopia would you be willing to make this a reop instead of a gist? I have some pull requests in mind.
-
Absolutely @ccc -- here's a full-fledged repo: https://github.com/paultopia/spideyscrape
PRs welcome!
Also, I've refactored a little to make the code a bit more modular, and also to produce technically valid html.
-
FYI, I've tossed up a quick python3 compatible version in a gist for the latest version of pythonista. (Next steps, a proper repo with a version that can handle 2 or 3, plus hopefully/maybe/one day a way to grab images and include in resulting html.)
https://gist.github.com/paultopia/39cb21e080b4abe24de8056e92a40ed2
-
Shouldn't it recurse to the boundary of the domain?
i.e. Relative links are in, as are ones for the same domain name. But ones outside of the domain aren't.
Not having examined the code I don't know if you break cycles, also.