Archive

Archive for the ‘groovy’ Category

XML parsing with groovy on the command line

October 8th, 2009 2 comments

To continue with the command line programming theme, I will move over to Groovy. One of the strengths of Groovy, and something I have been doing a lot of lately, is XML processing. Groovy has 2 standard libraries for parsing XML, XmlSlurper and XmlParser. In this post I will look at using XmlSlurper as a pseudo grep tool for looking up values in XML.

Now grep is great for finding instances of a string in files. Problem with XML is that every string is wrapped by ugly XML tags and a lot of XML documents are not neatly split over new lines. Also the requirement may want to find 2 relevant tags that relate to each other in a given XML structure but which may be many lines apart. Also there could be tags named the same way at different levels of the XML, so intelligent parsing is the only way to go.

For an example I will use the following XML:

<listings>
  <listing>
    <agency>
      <name>Obera Associates</name>
      <suburb>MELBOURNE</suburb>
      <post-code>3000</post-code>
    </agency>
    <price>$4,500</price>
    <address>
      <suburb>RICHMOND</suburb>
      <street>Victoria Street</street>
      <state>Vitoria</state>
      <post-code>3121</post-code>
    </address>
  </listing>
</listings>

Each file can have many listings, and there could be many files. To find out if all the “listings” for RICHOMOND 3121 have been uploaded to the listings ingest-a-nator, a crude test may be to double check how many such suburb/postcode instances there were in the original files. As already mentioned grep would be a pain to use especially as there are 2 suburb/postcodes for each listing node and you are only interested in one. So here is the Groovy solution.

$ groovy -e 'dir = new File("."); dir.eachFileMatch(~".*xml") { file -> \
    doc = new XmlSlurper().parse(file); doc.listing.address.each { address -> \
         println address."post-code".toString() + address.state \
    } \
}'

One gotcha that I found was that to print 2 nodes in a row, a + sign is not enough. As they are of type
groovy.util.slurpersupport.NodeChildren
, at least the first needs to be changed to a string with the toString() call.

To expand this to use larger files more memory may have to be given to the executing process, to give 1GB this would mean adding
JAVA_OPTS="-Xmx1024M"
like so:

$ JAVA_OPTS="-Xmx1024M" \
groovy -e 'dir = new File("."); dir.eachFileMatch(~".*xml") { file -> \
    doc = new XmlSlurper().parse(file); doc.listing.address.each { address -> \
        println address."post-code".toString() + address.state\
    } \
}' | sort | uniq -c

this also includes a pipe to unix favorites sort and uniq -c to give a histogram like count of each suburb/postcode combination.

I was going to suggest that if the XML is not perfect (like if it is HTML) or may have some badly encoded characters that are not valid UTF-8 and which would return an error similar to:

org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0x1c)
was found in the element content of the document.

a more lenient parser like TagSoup could be used. In my case this worked but unfortunately in this case this silently fails like so

$ groovy -e 'println new XmlSlurper().parseText("""
<ls><l><a><n>micsar</n></a><adr><suburb>RICHMOND</suburb></adr></l></ls>
""").l.adr.suburb'
RICHMOND
$ CLASSPATH=tagsoup-1.2.jar \
groovy -e 'println new XmlSlurper(new org.ccil.cowan.tagsoup.Parser()).parseText("""
<ls><l><a><n>micsar</n></a><adr><suburb>RICHMOND</suburb></adr></l></ls>
""").l.adr.suburb'
 
$ CLASSPATH=tagsoup-1.2.jar \
groovy -e 'println new XmlSlurper(new org.ccil.cowan.tagsoup.Parser()).parseText("""
<ls><l><a><n>micsar</n></a><adr><suburb>RICHMOND</suburb></adr></l></ls>
""")'
micsarRICHMOND

It is the 2nd scenario above that seems not to work when using TagSoup, maybe some combination with Groovy? not sure for the moment.

-e to execute from command line

September 23rd, 2009 2 comments

One of the useful tools in the unix world is the command line. Often you may have a request like

give me a list of all the unique Agencies specified in all the XML files in a given directory tree where an agency is shown in an element <agency id="IJKXYZ">

this can be easily achived (for pretty printed XML documents, each element on it’s own line) with a variety of unix tools like so

find . *xml | xargs grep agency.id $1 | awk -F '"' '{print $2}' | sort | uniq -c

find all the *.xml files from the current directory, for this list execute grep using xargs and grep for the agency.id pattern (the ‘.’ matches any character, in this case a space, ‘ ‘) use awk, delimiting on double quotes, to print out the ID’s, then sort them and run them through uniq with -c flag to get the count of occurrences for every agency ID. Yes, there are no doubt other ways of doing this, and different sort’s of optimisations but this is just the way my brain thinks and wire’s together these commands.

As already mentioned, this will only work if the XML is pretty printed and not if all the white space has been removed, there may also be many cases where the request is more complicated but it feels like opening up a text editor to write a program should not be necessary. This is where the -e option of many scripting languages comes in, it lets you run the language from the command line.

The basic hello world in a couple of such languages

groovy -e "println 'hi'"
perl -e 'print "hi\n"'
ruby -e 'puts "hi\n"'
jruby -e 'puts "hi\n"'

As a more complex example you may want to find out the day of the week for a given date. This can be done like so

$ groovy -e 'println Date.parse("MM/dd/yyyy", "12/13/1974").format("EEEE")'
Friday

of course again there may be a unix command out there which is more concise or which the user may be more familiar with, in which case great, use that instead. If on the other hand it is a situation where the code solution comes to you immediately then why not use it? If this is the way your brain is wired, and you are more competent at using your programming language of choice then why go reading man pages when you can do this simply with a scripting language and the -e flag.

Counter to that argument I had a snippet in PowerShell, the Microsoft Windows shell language for finding the day for a given date and it goes like this

PS> (get-date "12/13/1974").DayOfWeek

Very concise indeed.

Of course the real power of scripting languages is when you use them in conjunction with pipes and unix tools. In this case I want a histogram of how many files I modify for each day of the week for a given directory.

groovy -e 'new File(".").eachFile{file -> println new Date(file.lastModified()).format("EEEE")}' \
 | sort | uniq -c | sort -rn
 
19 Monday
15 Tuesday
10 Wednesday
5 Friday
4 Thursday

Given that it was a work directory then the work day’s only is understandable. Also as I am running this on a Wednesday morning that may skew the results of “last modified” if I have modified most files. Still it seems that Thursday and Friday are the low parts of the week.

Feel free to add how you use scripting languages from the command line in the comments. Do be careful cutting and pasting any code samples and make sure you know what you are doing prior to doing so.

I will look at a few common examples I use regularly in future posts and I hope to see more people using scripting languages from the command line to solve their problems.

Groovy script to find country origin of IP

February 8th, 2009 No comments

After listening to grails podcast #77, I was interested in some of the code snippets available around the site for the gr8conf “Conference dedicated to Groovy, Grails & Griffon and other Great technologies”. The direct link is http://www.gr8conf.org/blog/2009/02/02/5#comments. I was looking at code around utilising some of the webservicex.net webs services and in particular Geo IP Service for “Finding the visitors country by IP“. The example given was for a Grails service. I was interested to run it as a script.

Seems simple, cut, paste, run … class could not be run! Ok so I am trying it as a script from a command line. I need to either create a main or easier still just invoke the method from a command line groovy script

groovy -e "gips = new GeoIPService(); println gips.getCountryByContext()"
 
Caught: groovy.lang.MissingPropertyException: No such property: log for class: GeoIPService
at GeoIPService.getProperty(GeoIPService.groovy)
at GeoIPService.getCountryByContext(GeoIPService.groovy:9)
at script_from_command_line.run(script_from_command_line:1)
at script_from_command_line.main(script_from_command_line)

Ok so I need a logger library. To keep things simple I try java.util.logging.Logger and the method Logger.getAnonymousLogger() . Unfortunately this led to an exception along the lines of

Caught: groovy.lang.MissingMethodException: No signature of method: java.util.logging.Logger.warn() is applicable for argument types: (java.lang.String, groovy.lang.MissingMethodException) ...

java.util.logging.Logger is not compatible with the log4j logger used in grails which allows passing of a String and an Exception to the warn. Who needs logging anyway, this is a script right? next step was for me to uncomment the log statements and the script ran. Yeah! output was the disappointing statement

null

What? where did that come from? sure there are return null’s in the code returned from exception code but the exception is … silent :( … as I have commented out the log statements. Now in desperation I would just change them to println statements and be done with it but why not stay with the “logging” best of breed methodology. Back to trying to find a version of log4j lying around. In the end I found one in my maven home ~/.m2 . I suppose this raises the question should I be using maven or other build/dependency tool for building this simple learning script? is that too much for a script? TDD says not it ain’t. Just to make sure it runs I tried

import org.apache.log4j.*
 
CLASSPATH=$CLASSPATH:~/.m2/repository/log4j/log4j/1.2.14/log4j-1.2.14.jar groovy -e 'gips = new GeoIPService(); println gips.getCountryByContext()'


null

damn, and this time with logging statements. Maybe it is time to put this into an IDE and get some suggestions as to what is going on. I give up on the TDD and start out putting print log statements through the code. I find the bug

def response = "http://www.webservicex.net/geoipservice.asmx/"+method.toURL().text

the toURL() method is only acting on part of the string, the method string, which fails to be a proper URL so it fails, but silently as it is in a method which throws exception and is only reported by a log.warn and by default my logging is set to the highest I presume log.fatal? now how to change the level of the logger? ok I should have guessed setLevel(org.apache.log4j.Level level) now where to set it? I do not have a constructor everything is pretty much static so I suppose I can just use a static block like so

static def log = Logger.getRootLogger()
static{
    log.setLevel(Level.INFO)
}

now I get the full pleasure of a groovy style exception

java.net.MalformedURLException: no protocol: GetGeoIPContext
at java.net.URL.&lt;init&gt;(URL.java:567)
at java.net.URL.&lt;init&gt;(URL.java:464)
at java.net.URL.&lt;init&gt;(URL.java:413)
at org.codehaus.groovy.runtime.DefaultGroovyMethods.toURL(DefaultGroovyMethods.java:2479)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.codehaus.groovy.runtime.metaclass.ReflectionMetaMethod.invoke(ReflectionMetaMethod.java:51)
at org.codehaus.groovy.runtime.metaclass.NewInstanceMetaMethod.invoke(NewInstanceMetaMethod.java:54)
at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:230)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:912)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:756)
at org.codehaus.groovy.runtime.InvokerHelper.invokePojoMethod(InvokerHelper.java:766)
at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:754)
at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodN(ScriptBytecodeAdapter.java:170)
at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethod0(ScriptBytecodeAdapter.java:198)
at GeoIPService.getCountryByUrl(GeoIPService.groovy:33)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:86)
at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:230)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:912)
at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnCurrentN(ScriptBytecodeAdapter.java:78)
at GeoIPService.getCountryByContext(GeoIPService.groovy:14)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:86)
at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:230)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:912)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:756)
at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:778)
at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:758)
at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodN(ScriptBytecodeAdapter.java:170)
at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethod0(ScriptBytecodeAdapter.java:198)
at script_from_command_line.run(script_from_command_line:1)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:86)
at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:230)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:912)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:756)
at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:778)
at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:758)
at org.codehaus.groovy.runtime.InvokerHelper.runScript(InvokerHelper.java:401)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:86)
at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:230)
at groovy.lang.MetaClassImpl.invokeStaticMethod(MetaClassImpl.java:1105)
at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:749)
at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodN(ScriptBytecodeAdapter.java:170)
at script_from_command_line.main(script_from_command_line)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:86)
at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:230)
at groovy.lang.MetaClassImpl.invokeStaticMethod(MetaClassImpl.java:1105)
at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:749)
at groovy.lang.GroovyShell.runMainOrTestOrRunnable(GroovyShell.java:244)
at groovy.lang.GroovyShell.run(GroovyShell.java:453)
at groovy.lang.GroovyShell.run(GroovyShell.java:433)
at groovy.lang.GroovyShell.run(GroovyShell.java:160)
at groovy.ui.GroovyMain.processOnce(GroovyMain.java:496)
at groovy.ui.GroovyMain.run(GroovyMain.java:308)
at groovy.ui.GroovyMain.process(GroovyMain.java:294)
at groovy.ui.GroovyMain.processArgs(GroovyMain.java:111)
at groovy.ui.GroovyMain.main(GroovyMain.java:92)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.codehaus.groovy.tools.GroovyStarter.rootLoader(GroovyStarter.java:101)
at org.codehaus.groovy.tools.GroovyStarter.main(GroovyStarter.java:130)

as opposed to what java would give me (here demonstrated with a groovy script)

groovy -e "println 'http://'+'GetGeoIPContext'.toURL()"
Caught: java.net.MalformedURLException: no protocol: GetGeoIPContext
at script_from_command_line.run(script_from_command_line:1)
at script_from_command_line.main(script_from_command_line)

which lets me take it one step further and confirm

groovy -e "println new String('http://'+'GetGeoIPContext').toURL()"
http://GetGeoIPContext

and back to the original,

def response = new String("http://www.webservicex.net/geoipservice.asmx/"+method).toURL().text

and as a complete script (heavily borrowing from the original Finding a users country by IP)

import org.apache.log4j.*
 
class GeoIPService {
 
    static def log  = Logger.getRootLogger()
    static{
        log.setLevel(Level.INFO)
    }
 
    def getCountryByContext() {
        try {
            return getCountryByUrl("GetGeoIPContext")
        } catch (Exception e) {
            log.warn "Exception in GeoIP service", e
            return null
        }
    }
 
    def getCountryByIp(String ipAddress) {
        try {
            return getCountryByUrl("GetGeoIP?IPaddress=" + ipAddress)
        } catch (Exception e) {
            log.warn "Exception in GeoIP service", e
            return null
        }
    }
 
    def getCountryByUrl(String method) throws Exception {
        def response = new String("http://www.webservicex.net/geoipservice.asmx/"+method).toURL().text
        def slurp = new XmlSlurper()
        def geoip = slurp.parseText(response)
        if (geoip.ReturnCode == 1) {
            if (geoip?.CountryName?.text() == "RESERVED") {
                return null
            } else {
                def country = geoip.CountryName.text()
                return country != null ? country[0].toUpperCase() + country[1..-1].toLowerCase() : null
            }
        } else {
            throw new Exception(geoip.ReturnCodeDetails.text())
        }
    }
    static void main(args) {
        def gisp = new GeoIPService()
        println args ? gisp.getCountryByIp(args[0]) : gisp.getCountryByContext()
    }
}

now for TDD and dependency management of log4j libraries and the like … maybe another day ;)