Quick hackish html-book documentation scraper for offline reading

paultopia

You know what's annoying? When people post stuff online as html-books, like with a table of contents page and then a bunch of linked sub-pages with all the content. Lots of documentation, in particular, is organized that way (example: http://tldp.org/LDP/abs/html/ ) and it drives me nuts because I like to read stuff offline, like on my iPad on airplanes.

Solution: a script to crawl such pages starting from the ToC and scrape unique URLs linked therein, one level deep, and append them to one big file that can then be read offline.

Gist: https://gist.github.com/paultopia/460acfda07f9ca7314e5

Takes URL of ToC page from raw_input, deposits a html file in pythonista's internal file system. From there, do with it what you will---pass it to a Dropbox upload script, pass it to docverter or something to make it into a PDF, whev. You can also use pythonista export function to get it into Dropbox or another app (like PDF converter) the easy way, but, for some odd reason, the export function only works when the filename ends with .py (why is this anyway?), so you'll have to edit the filename then edit back.

Caveats: assumes all links are relative urls on same server, and has exactly zero validation to check that (easy to add, I just haven't bothered). Will probably crash if that assumption is violated. Also produces invalid html but not in any way that will bother any browser or converter. Finally, assumes that documents to scrape are structured with content in vanilla html, no Ajax calls or the like.

Webmaster4o

The part where it doesn't check that URLs are relative does seem like an issue, you made an entire web scraper ;) This sounds really cool though.

paultopia

Heh yeah I'm about to add a little validation just for sanity-preserving purposes.

I also just updated so it can handle ToC pages other than index.html or equivalent.

paultopia

Improved! There's a new and much more effective version of the script that:

Confirms links come from same domain
Better handles URLS relative to root rather than to ToC folder.

Gist: https://gist.github.com/paultopia/02ca124a111a70faf174

ccc

@paultopia would you be willing to make this a reop instead of a gist? I have some pull requests in mind.

paultopia

Absolutely @ccc -- here's a full-fledged repo: https://github.com/paultopia/spideyscrape

PRs welcome!

Also, I've refactored a little to make the code a bit more modular, and also to produce technically valid html.

paultopia

FYI, I've tossed up a quick python3 compatible version in a gist for the latest version of pythonista. (Next steps, a proper repo with a version that can handle 2 or 3, plus hopefully/maybe/one day a way to grab images and include in resulting html.)

https://gist.github.com/paultopia/39cb21e080b4abe24de8056e92a40ed2

MartinPacker

Shouldn't it recurse to the boundary of the domain?

i.e. Relative links are in, as are ones for the same domain name. But ones outside of the domain aren't.

Not having examined the code I don't know if you break cycles, also.