Archive

Archive for October, 2009

XML parsing with groovy on the command line

October 8th, 2009 2 comments

To continue with the command line programming theme, I will move over to Groovy. One of the strengths of Groovy, and something I have been doing a lot of lately, is XML processing. Groovy has 2 standard libraries for parsing XML, XmlSlurper and XmlParser. In this post I will look at using XmlSlurper as a pseudo grep tool for looking up values in XML.

Now grep is great for finding instances of a string in files. Problem with XML is that every string is wrapped by ugly XML tags and a lot of XML documents are not neatly split over new lines. Also the requirement may want to find 2 relevant tags that relate to each other in a given XML structure but which may be many lines apart. Also there could be tags named the same way at different levels of the XML, so intelligent parsing is the only way to go.

For an example I will use the following XML:

<listings>
  <listing>
    <agency>
      <name>Obera Associates</name>
      <suburb>MELBOURNE</suburb>
      <post-code>3000</post-code>
    </agency>
    <price>$4,500</price>
    <address>
      <suburb>RICHMOND</suburb>
      <street>Victoria Street</street>
      <state>Vitoria</state>
      <post-code>3121</post-code>
    </address>
  </listing>
</listings>

Each file can have many listings, and there could be many files. To find out if all the “listings” for RICHOMOND 3121 have been uploaded to the listings ingest-a-nator, a crude test may be to double check how many such suburb/postcode instances there were in the original files. As already mentioned grep would be a pain to use especially as there are 2 suburb/postcodes for each listing node and you are only interested in one. So here is the Groovy solution.

$ groovy -e 'dir = new File("."); dir.eachFileMatch(~".*xml") { file -> \
    doc = new XmlSlurper().parse(file); doc.listing.address.each { address -> \
         println address."post-code".toString() + address.state \
    } \
}'

One gotcha that I found was that to print 2 nodes in a row, a + sign is not enough. As they are of type
groovy.util.slurpersupport.NodeChildren
, at least the first needs to be changed to a string with the toString() call.

To expand this to use larger files more memory may have to be given to the executing process, to give 1GB this would mean adding
JAVA_OPTS="-Xmx1024M"
like so:

$ JAVA_OPTS="-Xmx1024M" \
groovy -e 'dir = new File("."); dir.eachFileMatch(~".*xml") { file -> \
    doc = new XmlSlurper().parse(file); doc.listing.address.each { address -> \
        println address."post-code".toString() + address.state\
    } \
}' | sort | uniq -c

this also includes a pipe to unix favorites sort and uniq -c to give a histogram like count of each suburb/postcode combination.

I was going to suggest that if the XML is not perfect (like if it is HTML) or may have some badly encoded characters that are not valid UTF-8 and which would return an error similar to:

org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0x1c)
was found in the element content of the document.

a more lenient parser like TagSoup could be used. In my case this worked but unfortunately in this case this silently fails like so

$ groovy -e 'println new XmlSlurper().parseText("""
<ls><l><a><n>micsar</n></a><adr><suburb>RICHMOND</suburb></adr></l></ls>
""").l.adr.suburb'
RICHMOND
$ CLASSPATH=tagsoup-1.2.jar \
groovy -e 'println new XmlSlurper(new org.ccil.cowan.tagsoup.Parser()).parseText("""
<ls><l><a><n>micsar</n></a><adr><suburb>RICHMOND</suburb></adr></l></ls>
""").l.adr.suburb'
 
$ CLASSPATH=tagsoup-1.2.jar \
groovy -e 'println new XmlSlurper(new org.ccil.cowan.tagsoup.Parser()).parseText("""
<ls><l><a><n>micsar</n></a><adr><suburb>RICHMOND</suburb></adr></l></ls>
""")'
micsarRICHMOND

It is the 2nd scenario above that seems not to work when using TagSoup, maybe some combination with Groovy? not sure for the moment.