Capture Specific Webpage text using regex/searchHTML and save as new textile

rsayeed

I wanted to know if there are any good scripts in use in Python that could demonstrate –Capture Specific Webpage text using regex/searchHTML and save as new textile in editorial? Before i embark on making on?

I couldn't find one using search

ccc

I avoid regex like the plague that it is. You might check out the get_module_version_from_pypi() function in https://github.com/cclauss/pythonista-module-versions/blob/master/pythonista_module_versions.py

It uses BeautifulSoup 4 and requests that are both built into Pythonista (and I hope Editorial too).

import bs4, requests

def get_soup(url):
    return soup = bs4.BeautifulSoup(requests.get(url).text, 'html5lib')

cook

I agree with @ccc
I used to use regex to do that sort of thing. You can get it to work, but it takes a lot of effort.
So I embarked to learn how Beautiful Soup works, and now that I've learned it a bit it's way easier and much more effective for many things (not just search).

Here's a little example from cheese.com. (I like cheese...not that I check this website, but cheese is good.)

# coding: utf-8
import requests
from bs4 import BeautifulSoup

url = 'http://www.cheese.com'

soup = BeautifulSoup(requests.get(url).text)

print soup.find('div', id='abstract') #find one div with id 'abstract' and print

Webmaster4o

BeautifulSoup is great. @rsayeed: Read this for more information.

Phuket2

Just as a side note. I think it's also worth checking to see if a webpage/site has a json interface. Of course many are protective of their data. But it's seems like more and more sites are offering json data. Can save a lot of work and fragile scripts

cook

@Phuket2 is right....much easier!
and also API's....!

Phuket2

Actually, I was just looking around. I found this site. But there are many sites like this
https://www.publicapis.com/

The below is to get info on Formula 1 Drivers. But if you look at the API, you can get everything about F1. It's so nice. Could easily write an F1 App with this API. I chopped this up. In the full code I am caching to disk etc.

But there is another interesting line that's commented out. The url to a Github repo for a Pantone color list. I just left there as a example. I didn't know before how easy it was to do that. The sample code does not deal with that site. But it's real interesting that we could publish data in our on repos that we all could all use.

Anyway, hope I haven't convoluted the conversion. But the possibilities are exciting.

import json, requests

def get_json_data(url):
	r = requests.get(url)
	print('r == ', r)
	
	if r.status_code == 200:
		return r.json()
		
if __name__ == '__main__':
	#url = 'https://raw.githubusercontent.com/teelaunch/pms-pantone-color-chart/master/params.json'
	url = 'http://ergast.com/api/f1/2016/drivers.json'
	r = get_json_data(url)
	drivers =  r['MRData']['DriverTable']['Drivers']
	print(drivers)

rsayeed

Wow, this is great. You got a pretty active community here!

I'm basically trying to scrape-off some text from webpages and no, theres no JSON there. I'll surely look at this Beautiful Soup thing. THis is the time i heard it.

So my plan is:

I'm probably gonna use Pythonista's action extension.

Browse the page on Safari (in iOS)
execute the Pythonista extension (assuming I make it).
Get the result Text
safe/append it to a file in dropbox.

cook

@rayseed
Do you want to scrape text off of any old website? Or a particular one?

If it's for random sites, then even using Beautiful soup will be a little tough. Well, anything would be tough for that matter :)

Usually site developers use some constants throughout their code - but going from different site to different site it's not constant and that's where you would need flexibility in your implementation of Beautiful Soup.

But, often, headings are in heading tags and text content is in p tags and etc. So...a generic scraper is possible - but may not get everything, or most likely you'll get more than what you want.

Example:

# coding: utf-8
import requests
from bs4 import BeautifulSoup

url = 'http://www.cheese.com/'
soup = BeautifulSoup(requests.get(url).text)

for i in soup.find_all(lambda tag: tag.parent.name == 'body'):
	print i.text.strip()
	
#gives a lot of junk...