XML / XPath Explorer For Editorial?

MartinPacker

I had a recent need to extract some information from some XML - and didn't really want to write a program for it.

I could've done with an XML (XPath-based) ad hoc query program.

I'm thinking - on a long pair of flights coming up - of writing one in Editorial. Editorial because I can pull the file in and operate on it with some Python and then some.

Could such a thing be of general interest?

Anyone got experience with XML parsing and XPath in Editorial? (For example advice such as "start with Beautiful Soup" or "do it in Pythonista" would be helpful.

ccc

Hi Martin, +1 for "start with Beautiful Soup". My bet is that you will crank out a solution so fast that you will look no further. My approach would be to get the basic parsing working and debugged in Pythonista and then backport it into Editorial.

By the way, did you get one of those new LinuxONE mainframes that run hundreds of thousands of Docker containers at one time. I was totally impressed with the youtube video of this week's demo; https://www.youtube.com/watch?v=VWBNoIwGEjo

MartinPacker

Not GOT one @ccc . Work for the company that makes'em. :-)

Could get very interested in them, though.

Not sure if Beautiful Soup supports XPath or XQuery.

But I think both Pythonista and Editorial have bs4.

dgelessus

Pythonista-only guy here. Just so you know, BeautifulSoup is a library meant for HTML parsing, not XML in general, and makes certain assumptions that are specific to HTML. I recall trying to use BS4 to quickly pretty-print an XML file, and had issues with tags named link. Because <link/> tags in HTML are only found within the <head> and never have any content. In the XML file I was dealing with, they had content, which got dropped. (I may be remembering some things wrong, but BS4 does not always work well with XML files that are not (X)HTML.)

Python comes with some XML parsers in the standard library, you should probably start with the ElementTree module. Pythonista also comes with the third-party xmltodict module, though I have never used it and I don't know if it's also included in Editorial.

ccc

The very first line of the BS docs say: "Beautiful Soup is a Python library for pulling data out of HTML and XML files."

Also: "To parse XML you pass in “xml” as the second argument to the BeautifulSoup constructor.".

import xmltodict ; print(xmltodict.__version__) # 0.8.7 in both Editorial and Pythonista

Also see: https://docs.python.org/2/library/xml.etree.elementtree.html#xpath-support and https://docs.python.org/2/library/markup.html in general.

MartinPacker

Thanks @dgelessus and @ccc . I'm going to confer with @ccc - who works for the same company as me (but we've not yet met).

I'm veering towards the standard xml module right now and seeing whether its query capabilities are up to it.

The issue with Beautiful Soup for me is the limited querying capability.

I'm thinking of presenting selected nodes one after the other pretty printed - in the first iteration. More sophisticated output later on. (And no I'm not going to recreate XSLT / Saxon. And I don't think we already have an XSLT parser to play with here.)

One problem I have to solve is how to wildcard actual strings in whichever query language. For example I want to find all nodes where the text starts "FTR" - in my initial use case. (The one that invoked The "Principle" Of Sufficient Disgust.) :-)

Prototyping on Editorial I think. Maybe transferring to Pythonista later.

I have 2 11-hour flights to/from South Africa to play with this on... :-)

MartinPacker

And note xml module claims a subset of xpath support.