67 Bricks blog – Page 10 – Experienced technology and product experts working with data, content and media businesses to grow their value

Reactive Functional Programming

In our dev meeting this week, we discussed the Reactive Functional Programming Scala course from Martin Odersky, that Chris has been doing. Several others of us have signed up for it, but haven’t been very good at actually doing the course. Chris told us what he had been doing, and what he had learned (aside from being impressed by Eric Meijer’s shirts).

Reactive Programming is:

Event-driven – asynchronous with no blocking
Scalable – across multiple cores and machine
Resilient – against failure
Responsive – it is responsive to the user

The RFP course has been about techniques for doing reactive programming that wrap these concepts neatly so the complexity of them is hidden and they can easily be composed.

Values can be separated along two axes – single/multiple and synchronous/asynch. Hence, in terms of Scala types:

a single synchronous value – a Try[T]
a single asynch value – a Future[T]
multiple synchronous values – Iterable[T]
multiple asynch values – Observable[T]

The “multiple asynch” area is where RFP fits. An “Observable” is something to which you can give an Observer, and then it will send callbacks to your Observer. For example, a set of full pint glasses (i.e. a Round) on a table might be an Observable – and then the Round might call back to your Observer to inform it about each pint. You could combine this with another Observable – for example each pint can itself be an Observable that informs you when it is empty. Hence, you can identify when a new round needs to be bought by composing the Round observables with each Pint observable, and receving all the messages from each.

Another part of the course is using Akka. Akka is an actors library for Scala (used in Play). Actors are autonomous pieces of code that each individually run synchronously, and communicate via message passing. The core of an Akka actor is a receive method that takes a message and carries out some activity. It’s also possible to use “become” on an Akka actor to transform the actor to have different behaviour – this is a way of better managing the mutable state within the actor (you can “unbecome” as well…). Error handling for actors is interesting, because actors are responsible for their children – if a child actor dies, then the parent actor’s supervision strategy determines what should happen (e.g. throwing the child away and recreating a new one, or entering a suicide pact to have the parent actor die if the child actor dies, or sending poison pills to destroy other actors, and so on). It all gets quite macabre.

One of the course projects was to build a binary tree in which each node is an actor. If a node is removed, then it instead marks itself as removed, and a separate garbage collection process actually takes nodes out. Akka manages the ordering of messages, so once you get the basic logic right, it works very quickly. But – it’s hard to debug.

A surprising feature of Akka is that you can create messages of any type – this is in contrast to Scala’s normal philosophy of making everything as strongly typed as possible. It’s recommended to use case classes as messages to provide better typing.

Surprising output from Specs2

I noticed yesterday that the output from some integration tests running on our CI server were producing large amounts of output. It turned out that 41000 lines of the 43000 lines were coming from a single test. The reason for this is the way Specs2 handles the contain() matcher with lists of Strings. It means that the following:

listOfIds must not(contain(badId))

is effectively the following, checking each string to see if it contains the given one:

listOfIds(0) must not(contain(badId))
listOfIds(1) must not(contain(badId))
listOfIds(2) must not(contain(badId))
....

So if you are looking at a long list, this yields some very verbose output. With types other than String, this seems to work as expected.

Selenium screenshots

We use Selenium for testing front-end web pages. My colleague Chris Rimmer just added code to take advantage of a Selenium feature I hadn’t previously been aware of – now, the tests will automatically take a screenshot of the browser if they fail, which should help us in tracking down issues.

Scala Stream.cons for creating streams from functions

A feature of Scala that I hadn’t used before was Stream.cons. This allows you to create a stream (essentially, a lazily evaluated list) from a function. So, for doing some work on files based on their position in the directory hierarchy, we can create a list of their parents:

def fileParents(file: File) : Stream[File] = {
  val parent = file.getParentFile
  if (parent == null) Stream.empty else 
    Stream.cons(parent, fileParents(parent))
}

and then use standard Scala functionality for filtering and finding things in lists, rather than having to write code that iterates through the parent files manually.

Parser combinators

In our developer meeting this week, we discussed parsing, and particularly parser combinators.

We’ve used the Scala parser combinator library in the past for parsing search query syntax – for example, to support a custom search syntax used by a legacy system and convert it into an XQuery for searching XML. We’ve also used Parboiled, a Java/Scala parser library, for parsing geographic latitude and longitude values from within scientific journal articles about geology. We’ve done simpler parsing with regular expressions in C# to identify citations within text like “(Brown et al, 2012)” and “(Brown and Smith, 2010; Jones, 2009)”.

The parser combinator approaches are typically better than using a traditional parsing method like Lex and YACC or JavaCC, because they’re written in the host language (e.g. Java or Scala), and so it’s much easier to write unit tests for them and to update them easily. They’re particularly approachable in Scala, because Scala’s support for domain-specific languages means that you can write code that looks like:

“{” ~ ( comment | directive ) ~ “}”

where the symbols like ~ and | are Scala method invocations – which means that you can focus on the parsing, rather than the parser library syntax.

We briefly discussed where it makes sense to use regular expressions for parsing, and where it makes sense to use a more powerful parsing approach. We agreed that there was a danger of creating overly complex regular expressions by incremental “boiling a frog” extensions to an initially simple regex, rather than stopping to rewrite using a parser library.

For further processing of the content once it’s been parsed, we discussed using the Visitor pattern. For example, having created an abstract syntax tree from a search query, it’s useful to use a visitor approach to turn that tree into a pretty printed form, or into an HTML form for display, or into a query language form suitable for the underlying datastore.

Git Flow and removing remote branches

We use Git Flow as our VCS process, which means that we develop on feature branches and merge those branches back into the master branch as part of our code review. It’s useful to delete these branches as they’re merged, because then anyone can see what is being worked on or needs code review by listing remote branches without being distracted by old branches. However, it’s easy for a reviewer to forget to delete the branches when they do the merge. These Git commands delete old branches that have already been merged with the “master” branch:

Delete local branches:

git branch –merged master | grep -v master | xargs -n1 git branch -d

Remove local tracking branches of remote branches that have already gone:

git prune

Remove remote branches that have been merged:

git branch -r –merged master | sed “s#origin/##” | grep -v master | xargs -n1 git push origin –delete

Power cuts and test driven development

We’ve just had an impromptu developer meeting because a power cut disabled all of our computers.

We discussed Test-Driven Development (TDD). Rhys talked about how mocking with a framework like Mockito makes test driven development easier to achieve, because you can use the mocks to check the side-effects of your code. Inigo felt that needing to use mocks was often a sign that your code had overly complex dependencies, and that it was sometimes better to instead make methods and components “pure” – so they didn’t have side-effects. Nikolay mentioned having used a .NET mocking framework, that might have been Moq, for testing some C# code that had a dependency on the database. Charlie discussed the problems of using an in-memory database that had slightly different behaviour from your actual database.

Virtuoso Jena Provider Problem

In a project that we’ve just started, we are using OpenLink Virtuoso as a triple store. I encountered a frustrating bug when accessing it via the Jena Provider where submitting a SPARQL query with a top-level LIMIT clause would return one less result than expected. In my case, the first query I tried was an existential query with LIMIT 1, so it caused much head scratching as to why I was getting no results.

Luckily OpenLink are responsive to issues raised on GitHub, so once I raised this issue and created an example project, it was quickly found to be solved by using the latest version of their JDBC4 jar. Problem solved.

Comparing characters without accents – making é and e the same

For a recent project, we needed to compare words without paying attention to their accents. In Java, you can do it like this:

String normalizedText = Normalizer.normalize(text, Normalizer.Form.NFD)
                                  .replaceAll("p{InCombiningDiacriticalMarks}+", "");

Automatically identifying (human) languages with code

For a recent project, we needed to automatically tell the difference between text that was written in French, German and English inside Word documents.

The simplest way of doing this is by checking the language attribute that’s been set on the style inside Word; unfortunately, very few Word users use the language value for styles correctly, or even use styles at all.

So, if we couldn’t trust the styles, we needed a mechanism that worked based on the text only. The first thing we tried was to identify some characteristic French, English and German words (like “des”,”und”,”für”,”and”), and check the text to see if it contained those words. The highest count of these distinctive words in a text determine which language it is likely to be.

This worked well, but we couldn’t be sure that the words would always appear in the text we were analyzing. So, we switched to an n-gram approach, as described in the thesis Evaluation of Language Identification Methods. This works by creating a “fingerprint” for the text, based on the occurrence of bigrams (“un”,”an”) and trigrams (“und”). It then compares this fingerprint to standard fingerprints for the various languages to find the one that it most resembles.

This gives better results when there is not much text available for analysis.

We’ve released the source code for this utility under an Open Source license. It’s written in Scala, an object-functional language that compiles to Java-compatible bytecode.