Appex Safari content

userista

I specifically want to use Safari (or Safari VC) to download the page - it's a JavaScript heavy page, and I also would like to make use of content blockers.

ccc

appex_dump.py shows that Firefox for iOS is only slightly more willing to share than Safari... Firefox gives you the page title.

JonB

Do you mean just full text from a single html page? Or, all images, linked stylesheets, javascript? are you trying to save the final DOM after javascript constructed the page?

My favorite method for page scraping is requests+bs4. You choose what extra stuff to retrieve. If you need to run the javascript on a page (you often don't, as long as you "fake it"), you might consider mechanize or a webview

userista

I need the final HTML because it's a page that has multiple iframes embedded inside each other - and I won't know the iframe URLs until runtime. It really would be a job for JavaScript - but the iframes are cross origin and WebKit doesn't allow JS to run cross origin. So I would like to let safari load all the JavaScript/iframes (and block all the ads ;) ) - then use BeautifulSoup to parse the HTML.

EDIT: According to the documentation ('Accessing a Webpage') it seems like @omz would have to add something to Pythonista to allow this to happen. Though since it seems Javascript based, I'm not sure if it would help in my scenario (because of cross origin requests).

JonB

Couldn't you just use bs4 to parse the top page, the load the individual iframes?

Alternatively, use a webview, with a custom delegate to catch iframes, then load those individually. As an example of some useful js logging functions, and how to get source from a loaded page: You can modify the delegate to also look for urls and handle the

# coding: utf-8

jslogging='''
console = new Object();
console.log = function(log) {
  // create then remove an iframe, to communicate with webview delegate
  var iframe = document.createElement("IFRAME");
  iframe.setAttribute("src", "ios-log:" + log);
  document.documentElement.appendChild(iframe);
  iframe.parentNode.removeChild(iframe);
  iframe = null;    
};
// TODO: give each log level an identifier in the log
console.debug = console.log;
console.info = console.log;
console.warn = console.log;
console.error = console.log;

// 2) custom onerror, which logs info about the error
   window.onerror = (function(error, url, line,col,errorobj) {
   console.log("error: "+error+"%0Aurl:"+url+" line:"+line+"col:"+col+"stack:"+errorobj);})

console.log("logging activated");
'''

# coding: utf-8
import ui,os,urllib
class debugDelegate (object):
    def webview_should_start_load(self,webview, url, nav_type):
       if url.startswith('ios-log'):
          print urllib.unquote(url)
       else:
          print url
       return True
w=ui.WebView()
w.eval_js(jslogging)


w.delegate=debugDelegate()

w.load_url('http://lab.pipwerks.com/javascript/cross-domain-iframes/')
#print source
w.eval_js('console.log(document.documentElement.outerHTML)')

Webmaster4o

There's no way to do this in pythonista. First of all, the way I understand it, content blockers don't pull out chunks of the html, just hide it. So letting content blockers activate and then downloading the HTML won't look different than if you download the HTML in the beginning. With my solution, you could parse it with bs4 to find your iframe URLs as @JonB mentioned

JonB

It depends on how complex the javascript is, obviously, and whether you can reproduce that by simply munging raw html. For anything that requires authentication/cookies, etc, requests will be easier than urllib2. But for pages that are heavily dynamically generated, you will want to use something lets js run.

A webview is running javascript, which is good, but you cannot walk iframes from the main window. If the iframes are independent, you could simply load each iframe separately. If there is communication between iframes, I am not sure this is possible. As for content blocking, which makes sense if you have a bunch of images that you don't want to download, you can roll your own using the delegate (return false for blocked content).

Hyashi, you seemed to think this was possible in a safariVC? If so, there may be an objc solution... if you can find references of how to do what you want in objectivec, it may be translateable to pythonista, as ling as it does not require new permissions.

userista

@JonB Pythonista could open Safari VC (using objc) but I still don't think it would be able to get the contents of the webpage (after it was loaded) without @omz adding the functionality to Pythonista (like I mentioned in a prior comment). It seems like Pythonista would have to register a JavaScript file that would be injected into the webpage.

JonB

If you run your own delegate, you can load whatever you want, since this happens outside of the browser. you would return False, then open the iframe in its own webview instance. although i am not sure if the delegate gets access to cookies/headers/etc ehich might be needed to open the same content...

what site are you trying to scrape?

JonB

btw, it is always possible to inject javascript into the top page of a document using eval_js, this does not violate any cross site scripting. But, that javascript runs at the level of the domain of the main document, and cannot access iframes outside of that domain. But you can easily get the source of the currently loaded top level document.