Parser combinators

In our developer meeting this week, we discussed parsing, and particularly parser combinators.

We’ve used the Scala parser combinator library in the past for parsing search query syntax – for example, to support a custom search syntax used by a legacy system and convert it into an XQuery for searching XML. We’ve also used Parboiled, a Java/Scala parser library, for parsing geographic latitude and longitude values from within scientific journal articles about geology. We’ve done simpler parsing with regular expressions in C# to identify citations within text like “(Brown et al, 2012)” and “(Brown and Smith, 2010; Jones, 2009)”.

The parser combinator approaches are typically better than using a traditional parsing method like Lex and YACC or JavaCC, because they’re written in the host language (e.g. Java or Scala), and so it’s much easier to write unit tests for them and to update them easily. They’re particularly approachable in Scala, because Scala’s support for domain-specific languages means that you can write code that looks like:

  “{” ~ ( comment | directive ) ~ “}”

where the symbols like ~ and | are Scala method invocations – which means that you can focus on the parsing, rather than the parser library syntax.

We briefly discussed where it makes sense to use regular expressions for parsing, and where it makes sense to use a more powerful parsing approach. We agreed that there was a danger of creating overly complex regular expressions by incremental “boiling a frog” extensions to an initially simple regex, rather than stopping to rewrite using a parser library.

For further processing of the content once it’s been parsed, we discussed using the Visitor pattern. For example, having created an abstract syntax tree from a search query, it’s useful to use a visitor approach to turn that tree into a pretty printed form, or into an HTML form for display, or into a query language form suitable for the underlying datastore.