Parser combinators

In our developer meeting this week, we discussed parsing, and particularly parser combinators.

We’ve used the Scala parser combinator library in the past for parsing search query syntax – for example, to support a custom search syntax used by a legacy system and convert it into an XQuery for searching XML. We’ve also used Parboiled, a Java/Scala parser library, for parsing geographic latitude and longitude values from within scientific journal articles about geology. We’ve done simpler parsing with regular expressions in C# to identify citations within text like “(Brown et al, 2012)” and “(Brown and Smith, 2010; Jones, 2009)”.

The parser combinator approaches are typically better than using a traditional parsing method like Lex and YACC or JavaCC, because they’re written in the host language (e.g. Java or Scala), and so it’s much easier to write unit tests for them and to update them easily. They’re particularly approachable in Scala, because Scala’s support for domain-specific languages means that you can write code that looks like:

  “{” ~ ( comment | directive ) ~ “}”

where the symbols like ~ and | are Scala method invocations – which means that you can focus on the parsing, rather than the parser library syntax.

We briefly discussed where it makes sense to use regular expressions for parsing, and where it makes sense to use a more powerful parsing approach. We agreed that there was a danger of creating overly complex regular expressions by incremental “boiling a frog” extensions to an initially simple regex, rather than stopping to rewrite using a parser library.

For further processing of the content once it’s been parsed, we discussed using the Visitor pattern. For example, having created an abstract syntax tree from a search query, it’s useful to use a visitor approach to turn that tree into a pretty printed form, or into an HTML form for display, or into a query language form suitable for the underlying datastore.

Git Flow and removing remote branches

We use Git Flow as our VCS process, which means that we develop on feature branches and merge those branches back into the master branch as part of our code review. It’s useful to delete these branches as they’re merged, because then anyone can see what is being worked on or needs code review by listing remote branches without being distracted by old branches. However, it’s easy for a reviewer to forget to delete the branches when they do the merge. These Git commands delete old branches that have already been merged with the “master” branch:

Delete local branches:

git branch –merged master | grep -v master | xargs -n1 git branch -d

Remove local tracking branches of remote branches that have already gone:

git prune

Remove remote branches that have been merged:

git branch -r –merged master | sed “s#origin/##” | grep -v master | xargs -n1 git push origin –delete

Power cuts and test driven development

We’ve just had an impromptu developer meeting because a power cut disabled all of our computers.

We discussed Test-Driven Development (TDD). Rhys talked about how mocking with a framework like Mockito makes test driven development easier to achieve, because you can use the mocks to check the side-effects of your code. Inigo felt that needing to use mocks was often a sign that your code had overly complex dependencies, and that it was sometimes better to instead make methods and components “pure” – so they didn’t have side-effects. Nikolay mentioned having used a .NET mocking framework, that might have been Moq, for testing some C# code that had a dependency on the database. Charlie discussed the problems of using an in-memory database that had slightly different behaviour from your actual database.

Virtuoso Jena Provider Problem

In a project that we’ve just started, we are using OpenLink Virtuoso as a triple store. I encountered a frustrating bug when accessing it via the Jena Provider where submitting a SPARQL query with a top-level LIMIT clause would return one less result than expected. In my case, the first query I tried was an existential query with LIMIT 1, so it caused much head scratching as to why I was getting no results.

Luckily OpenLink are responsive to issues raised on GitHub, so once I raised this issue and created an example project, it was quickly found to be solved by using the latest version of their JDBC4 jar. Problem solved.

Automatically identifying (human) languages with code

For a recent project, we needed to automatically tell the difference between text that was written in French, German and English inside Word documents.

The simplest way of doing this is by checking the language attribute that’s been set on the style inside Word; unfortunately, very few Word users use the language value for styles correctly, or even use styles at all.

So, if we couldn’t trust the styles, we needed a mechanism that worked based on the text only. The first thing we tried was to identify some characteristic French, English and German words (like “des”,”und”,”für”,”and”), and check the text to see if it contained those words. The highest count of these distinctive words in a text determine which language it is likely to be.

This worked well, but we couldn’t be sure that the words would always appear in the text we were analyzing. So, we switched to an n-gram approach, as described in the thesis Evaluation of Language Identification Methods. This works by creating a “fingerprint” for the text, based on the occurrence of bigrams (“un”,”an”) and trigrams (“und”). It then compares this fingerprint to standard fingerprints for the various languages to find the one that it most resembles.

This gives better results when there is not much text available for analysis.

We’ve released the source code for this utility under an Open Source license. It’s written in Scala, an object-functional language that compiles to Java-compatible bytecode.

Increasing accessibility of radio buttons and checkboxes on forms

If you’ve ever tried to use the keyboard to navigate around a form on a webpage, you may have noticed that it’s often very hard to see which form item is currently selected. With most form elements, this isn’t too hard to fix – you can add a border around the currently selected textbox with CSS, for example. But, radio buttons and check boxes are both very hard to make visible.

Following on from some accessibility work that we’ve been doing for a client, we’ve developed a JQuery JavaScript plugin that helps fix this problem, and helps make web forms more accessible. We’ve released this as Open Source, and we’ve called it the JQuery labelFocus plugin.

Effective Java updated

“Effective Java” by Josh Bloch was one of the best books on Java programming when it was released back in 2001. It gave a set of best practices for writing Java well, from the lead architect for the JDK libraries. I found it very useful in improving my Java coding and to help me think about code from the perspective of producing re-usable libraries as well as simple solutions.

So, I was really happy to get a copy of Effective Java 2nd edition at the weekend, brought up to date with Java 1.6. As before, it provides good idioms for writing Java code, now including information on generics and concurrency, as well as updating some of the practices from the first edition. The book is excellent. However much I think I know about Java, Josh Bloch still has things to teach me. Anyone working in the Java ecosystem, whether programming Java directly or Jython, JRuby, Rhino or Groovy, should read the book to learn more about the best ways of working with Java and the Java libraries.

Mostly for my reference, I’ve pulled out below the facts from the book that were new to me.

  • You can enforce the singleton property of a class by using an enum type for it – this gets rid of the need to write a custom readResolve method to preserve the singleton nature when used with serialization.
  • If you have to use a finalizer, explicitly call super.finalize; and consider an anonymous nested finalizer guardian if clients may fail to do so. (I think I’ve used finalizers perhaps once or twice in more than ten years of programming Java, so this isn’t likely to be that useful to me…)
  • In an equals implementation, compare floats and doubles with Float.compare and Double.compare to cope with NaN and -0.0, rather than a simple ==.
  • Consider using static factories to create immutable objects as an alternative to making the whole class final – it allows extension within the package.
  • “PECS” is a useful acronym to remember how to use generics wildcards – producer extends, consumer super.
  • Consider using a nested enum to provide a strategy pattern to share implementation between several instances of the outer enum (although in many cases, this seems like it could create more code than it saves).
  • Use EnumMap for maps keyed by an enum.
  • In a constructor for an immutable object, make defensive copies of passed-in mutable objects before checking for validity to avoid possible synchronization issues if those objects are changed in-between the check and the defensive copy.
  • Use Arrays.toString to replace use of Arrays.asList for printing out an array.
  • Use @code and @literal in JavaDoc to replace <code> and escaping of entities.
  • Use a CopyOnWriteArrayList inside an observable class to hold observers – to remove the need for synchronization. (note: what about making this use weak references too? is there a library class for this?)
  • Extendable serializable classes with instance fields that should not be initialized to their default values should implement the readObjectNoData method and throw an InvalidObjectException.
  • Consider using a serialization proxy, with readResolve and writeReplace referencing the proxy, to avoid the complexity of writing good serialization code for complex objects.

Downloading the .NET Framework source code

For a long time, one of the huge advantages of using Java over .NET has been that the Java library source code has been freely available – and browsing through it is very helpful for understanding library behaviour and for debugging.

Now, finally, the .NET Framework source code is also (mostly) available. Unfortunately, it’s in a slightly inconvenient form – it’s downloaded on demand as you debug into classes in VS.NET. So, some enterprising developers have written the NetMassDownloader which lets you download all of it at once, for offline reading, and general browsing. This is immensely useful! Thank you Kerem Kusmezer and John Robbins!

OpenHandle code examples in C# and F#

I’ve just contributed some code examples to the OpenHandle project.

OpenHandle exposes data from Handle, which is an interesting way of providing persistent digital identifiers for information, incorporating metadata. It’s used by systems such as DSpace and DOI.  In some ways, it’s a competitor to existing DNS-based ways of providing persistent URIs.

The code examples I’ve written demonstrate how to download OpenHandle data in F# and in C#. The C# implementation uses Linq.