Is advanced web scraping possible?

mikael

@ihf, I just posted an example on this thread. Funnily enough, the code was developed for the purposes of library books renewal, just like @JonB did. I suspect they are using the same library softwate eveywhere.

mikael

Those JS functions have turned out to be pretty convenient, although in need of more documentation, but there is lot to be said about the ease of development on a PC.

As there seems to be some interest in scraping, I think I will add a feature where you can click on elements and get the XPath for it.

ihf

As to the <form> property, I am assuming that, at least for the first login page that accepts the username, it is as shown below. But I've only used requests for really simple things, so this may be beyond my reach at this point. In any case, thanks for your replies.

<form class="form-horizontal" action="/login" id="login-form" method="post" onsubmit="return next();">
        <noscript>
        <input type="hidden" id="nojs" value="43c57a0c97bb1973571ac5bfacdad8e1" name="nojs">
    </noscript>
    <input type="hidden" id="login_ref" value="" name="login_ref">
...
    
</form>

P.S. I am trying to scrape name and address from a callsign query on qrz.com. 
There is a python module that will use an api for basic data but more data is available by logging in.

mikael

@ihf, this looks potentially undoable with requests due to the Javascript involved – but trivial with the WebView-based code I shared in the other thread.

ihf

@mikael where do I find the module inheritable? Actually, it is imported but I don't see that it is used.

mikael

@ihf, just remove the import. I removed that dependency for the express purpose of making this dependency-free - and then forgot to remove import. (Fix committed to github as well.)

ihf

@mikael Thanks. Jswrapper looks as if it could be quite useful though I remain unsure how to use it for qrz.com or whether requests could submit the username, password, and search term without it.

mikael

@ihf, yes, I have in many cases been able to submit login credentials with requests. Often you need to find some special magic string included on the form, and send it along with your login post request. Again, with the WebView approach you do not have to do that, or worry about cookies etc. Also, many sites these days have lots of dynamic JS content, which requests cannot help you with at all.

mikael

@ihf, after thinking long and hard about it, I have decided to accept your challenge. :-) What is it that you want to get out of that site?

ihf

@mikael I merely want to do a search by callsign which is on the upper left side of the main web page. However unless you can login the result is very limited. If you click on login you will be taken to qrz.com/login and you will then need to enter a username followed by the next button which takes you to a similar page to enter your password and a button to click to login. It is then that the search can be done which will result in a more detailed response. If you could help get me through login (even without proper credentials), I could probably take it from there

ihf

If it would help, I can send logs from the browser inspector tool (not sure how but surely there is a way).

ihf

Here are the inputs and buttons that will need to be programmed:

A page comes up briefly which says wait … and then it goes back to the main page (not sure if this complicates matters)

The search is then done:

it is the resulting page that I need to capture.

JonB

So, I used Microsoft Edge dev tools (F12), to look at the page source and request headers for each step.
This actually looks fairly simple to use standard techniques..

In the initial qrz.com/login response, there is some dynamic code. If you parse that, you will find "loginTicket", which you'll need to store:

   if (step == 1) {
            jQuery.ajax({
                dataType: "json",
                url: '/login-handshake',
                data: {'loginTicket': TOKENSTRING, 'username': jQuery('.login-container #username').val(), 'step' : 1},
                method: 'post',

(here I've replaced the actual string).
Typing user name and pressing submit, results in a request to /login-handshake, which contain these fields:
loginTicket: TOKENSTRING
step: 1
username: YOURUSERNAME

Next, typing password, results in another post to /login-handshake, containing:
loginTicket: TOKENSTRING
password: YOURPASSWORD
step: 2
username: YOURUSERNAME

Finally, if the handshake works, there is a final post to /login:

	'2fcode': ''  (empty string)
	'flush': 1
	'login_ref': 'https%3A%2F%2Fwww.qrz.com%2F'
	'password': YOURPASSWORD
	'target': '%2F'
	'username': YOURUSERNAME

After that, you get a cookie, which you keep using.

Turns out, the whole handshake seems unnecessary:

import requests

data = {'2fcode':'', 'flush': 1, 'login_ref':'https%3A%2F%2Fwww.qrz.com%2F', 'password':YOURPASSWORD, 'target':'%2F','username':YOURUSERNAME}
sess=requests.Session()
sess.get('http://qrz.com/login')
sess.post('http://qrz.com/login', data=data)

querydata={'tquery':'VA6BH', 'mode':'callsign'}
r=sess.post('https://qrz.com/lookup',data=querydata)

You can then pass r.content into bs4:

soup = bs4(r.content)
csdata=soup.find_all('td',id='csdata')

You can then parse the resulting table

ihf

@jonb You’ve done it again! Thank you. I ran the script and viewed the result using a webview (just to see it) and I am indeed logged in; however, the search does not appear to be executing. No error message, it is just sitting at the advanced search screen after login. If I repeat the search post, it does the search and I get the desired results. Thank you again..you always make it look easy but I know it would have taken me a long time to get it right.

ihf

After searching for the callsign w2aee, the result set from:

csdata=soup.find_all('td',id='csdata')

gives

[<td id="csdata" valign="top">
<span class="csignmg hamcall">W2AEE</span> <span class="ml4 cland">
<span class="ptr" onclick="window.location='https://www.qrz.com/atlas?dxcc=291'"><img alt="USA flag" id="flg" src="https://s3.amazonaws.com/files.qrz.com/static/flags-iso/flat/32/US.png" title="DX Atlas for: USA"/> <span style="position:relative;top:-8px;">USA</span></span></span><br/>

<p class="m0"> COLUMBIA UNIVERSITY AMAT RAD CLUB<span class="csgnl none">, W2AEE</span><br/>144 Washburn Rd<br/>Briarcliff Manor, NY 10510<br/>USA</p>
<p class="mt05 f9" style="font-weight:normal"></p>

<p class="mt05 f9 fi"> Email: <span id="qem" onclick="showqem();" onmouseover="showqem();">Use mouse to view..</span>
</p>
<p class="m0 f8">
<span class="green">Page managed by <a href="https://www.qrz.com/db/KD2DDT">KD2DDT</a></span> <span class="ml1">Lookups: 5677</span>
<input class="ml1" onclick="shlabel()" type="button" value="Label"/>
</p>
</td>]

What is the best way to parse this for the name and address? HTMLParser, BeautifulSoup, or just string functions? (I tried feeding this as a string to an HTMLParser instance and got no output).