Automatically identifying (human) languages with code

For a recent project, we needed to automatically tell the difference between text that was written in French, German and English inside Word documents.

The simplest way of doing this is by checking the language attribute that’s been set on the style inside Word; unfortunately, very few Word users use the language value for styles correctly, or even use styles at all.

So, if we couldn’t trust the styles, we needed a mechanism that worked based on the text only. The first thing we tried was to identify some characteristic French, English and German words (like “des”,”und”,”für”,”and”), and check the text to see if it contained those words. The highest count of these distinctive words in a text determine which language it is likely to be.

This worked well, but we couldn’t be sure that the words would always appear in the text we were analyzing. So, we switched to an n-gram approach, as described in the thesis Evaluation of Language Identification Methods. This works by creating a “fingerprint” for the text, based on the occurrence of bigrams (“un”,”an”) and trigrams (“und”). It then compares this fingerprint to standard fingerprints for the various languages to find the one that it most resembles.

This gives better results when there is not much text available for analysis.

We’ve released the source code for this utility under an Open Source license. It’s written in Scala, an object-functional language that compiles to Java-compatible bytecode.

Increasing accessibility of radio buttons and checkboxes on forms

If you’ve ever tried to use the keyboard to navigate around a form on a webpage, you may have noticed that it’s often very hard to see which form item is currently selected. With most form elements, this isn’t too hard to fix – you can add a border around the currently selected textbox with CSS, for example. But, radio buttons and check boxes are both very hard to make visible.

Following on from some accessibility work that we’ve been doing for a client, we’ve developed a JQuery JavaScript plugin that helps fix this problem, and helps make web forms more accessible. We’ve released this as Open Source, and we’ve called it the JQuery labelFocus plugin.

Effective Java updated

“Effective Java” by Josh Bloch was one of the best books on Java programming when it was released back in 2001. It gave a set of best practices for writing Java well, from the lead architect for the JDK libraries. I found it very useful in improving my Java coding and to help me think about code from the perspective of producing re-usable libraries as well as simple solutions.

So, I was really happy to get a copy of Effective Java 2nd edition at the weekend, brought up to date with Java 1.6. As before, it provides good idioms for writing Java code, now including information on generics and concurrency, as well as updating some of the practices from the first edition. The book is excellent. However much I think I know about Java, Josh Bloch still has things to teach me. Anyone working in the Java ecosystem, whether programming Java directly or Jython, JRuby, Rhino or Groovy, should read the book to learn more about the best ways of working with Java and the Java libraries.

Mostly for my reference, I’ve pulled out below the facts from the book that were new to me.

  • You can enforce the singleton property of a class by using an enum type for it – this gets rid of the need to write a custom readResolve method to preserve the singleton nature when used with serialization.
  • If you have to use a finalizer, explicitly call super.finalize; and consider an anonymous nested finalizer guardian if clients may fail to do so. (I think I’ve used finalizers perhaps once or twice in more than ten years of programming Java, so this isn’t likely to be that useful to me…)
  • In an equals implementation, compare floats and doubles with and to cope with NaN and -0.0, rather than a simple ==.
  • Consider using static factories to create immutable objects as an alternative to making the whole class final – it allows extension within the package.
  • “PECS” is a useful acronym to remember how to use generics wildcards – producer extends, consumer super.
  • Consider using a nested enum to provide a strategy pattern to share implementation between several instances of the outer enum (although in many cases, this seems like it could create more code than it saves).
  • Use EnumMap for maps keyed by an enum.
  • In a constructor for an immutable object, make defensive copies of passed-in mutable objects before checking for validity to avoid possible synchronization issues if those objects are changed in-between the check and the defensive copy.
  • Use Arrays.toString to replace use of Arrays.asList for printing out an array.
  • Use @code and @literal in JavaDoc to replace <code> and escaping of entities.
  • Use a CopyOnWriteArrayList inside an observable class to hold observers – to remove the need for synchronization. (note: what about making this use weak references too? is there a library class for this?)
  • Extendable serializable classes with instance fields that should not be initialized to their default values should implement the readObjectNoData method and throw an InvalidObjectException.
  • Consider using a serialization proxy, with readResolve and writeReplace referencing the proxy, to avoid the complexity of writing good serialization code for complex objects.

Downloading the .NET Framework source code

For a long time, one of the huge advantages of using Java over .NET has been that the Java library source code has been freely available – and browsing through it is very helpful for understanding library behaviour and for debugging.

Now, finally, the .NET Framework source code is also (mostly) available. Unfortunately, it’s in a slightly inconvenient form – it’s downloaded on demand as you debug into classes in VS.NET. So, some enterprising developers have written the NetMassDownloader which lets you download all of it at once, for offline reading, and general browsing. This is immensely useful! Thank you Kerem Kusmezer and John Robbins!

OpenHandle code examples in C# and F#

I’ve just contributed some code examples to the OpenHandle project.

OpenHandle exposes data from Handle, which is an interesting way of providing persistent digital identifiers for information, incorporating metadata. It’s used by systems such as DSpace and DOI.  In some ways, it’s a competitor to existing DNS-based ways of providing persistent URIs.

The code examples I’ve written demonstrate how to download OpenHandle data in F# and in C#. The C# implementation uses Linq.

Bloom Filter implementation in F#

Further to my previous post on bloom filters for efficient ontology lookup, I’ve made a simple implementation in F#. This is based on a Java implementation by Ian Clarke.

The neat thing about Ian’s implementation is its use of Random to extend the hashcode provided by the object being stored into a hash of arbitrary length, suitable for use by the bloom filter algorithm. This will reduce the quality of the hash, but for an arbitrary passed-in object, it’s hard to do better. (For a specific application, like storing ontology labels, it would be better to use a more specific algorithm such as a Jenkin’s Hash).

open System
// Based on Java Bloom Filter implementation by Ian Clarke 
type BloomFilter(bitArraySize : int, expectedElementCount : int) =
    let bitSet  = new System.Collections.BitArray(bitArraySize, false)
    let bitArraySize = bitArraySize
    let expectedElementCount = expectedElementCount
    let k = (int) (Math.Ceiling( ((double) bitArraySize / (double) expectedElementCount) * Math.Log(2.0) ))
    let bitSequence o =
        let r = new Random( hash o )
        Seq.init_infinite (fun n -> r.Next(bitArraySize))
    member b.expectedFalsePositiveProbability =
        Math.Pow((1.0 - Math.Exp( -((double) k) * (double) expectedElementCount / (double) bitArraySize)), (double) k)
    member b.add o =
        let sq = bitSequence o
        for x in 0 to k do bitSet.Set( Seq.hd sq , true)
    member b.addAll os = Seq.iter b.add os
    member b.clear = bitSet.SetAll(false)
    member b.contains o =
        let sq = bitSequence o
        let isSet n = bitSet.Get( Seq.hd sq )
        Seq.for_all isSet [0..k]
    member b.containsAll os = Seq.for_all b.contains os

Using F# makes the code slightly neater than Ian’s Java version – I’ve been able to factor out the hash code into a sequence supplied by the bitSequence function, and to collapse some of the for loops into operations over lists instead. But, the basic structure of the code is still very similar.

Oxford Semantic Web Interest Group

Myself, and Leigh Dodds of Ingenta, recently spoke at the Oxford SWIG. We were both talking about SPARQL, the standard query language and protocol for the Semantic Web.

Eamonn Neylon, who organized the session, has written up a summary of the talks. I was talking about DBPedia and how to write SPARQL extension functions; Leigh was talking about his SPARQL tool Twinkle, and about the different SPARQL query forms (SELECT, DESCRIBE, CONSTRUCT, and ASK) and what they are useful for.

Bloom Filters for efficient ontology querying and text mining

One of the problems with large ontologies such as SNOMED Clinical Terms is that they’re, well, large. So, it’s not typically possible to hold all of the ontology in memory at once, and queries against it require a database lookup. It’s possible to eliminate a number of database accesses, and thus speed up the query process, by using a Bloom filter.

A Bloom filter is a memory-efficient probabilistic data structure that lets you test whether a particular item is a member of a set. It may return false positives, but not false negatives. So, by adding all of the terms in your ontology to a Bloom filter, you can do a fast, in-memory check to see whether an entered term definitely doesn’t exist in your ontology. If the Bloom filter reports that the term does exist, then you can confirm with a slower file or database query for that term.

In an application where you expect to encounter many terms that aren’t in the ontology, such as automated metadata extraction from documents, and automated document classification, then this can potentially lead to large performance improvements.

I think there are also interesting possibilities in using Bloom filters in environments where storing a whole ontology isn’t feasible. For example, a JavaScript implementation of a Bloom filter, initialized with a few 100kb of data, could give a fairly high probability of testing accurately whether a particular term exists in an ontology of half-a-million terms.

Microsoft takes on GWT

It’s been a problem for a while that developers of web applications need to use a language like JavaScript on the web client, and another language like Java or C# or Python on the server. One popular attempt to fix this is Google’s GWT, and there have been other less mainstream options like ParenScript for Lisp and Links.

Now, Microsoft is launching another contender in the same space: Volta.

The post is somewhat obscure, but it’s essentially a beta version of a GWT competitor for .NET. You use annotations to mark chunks of code to be run on the client-side or server-side, and they’re compiled behind the scenes to JavaScript and deployed. There’s a debugger and profiler for the client-side code too.

An interesting feature about it, is that it works on MSIL (the .NET bytecode) rather than on the language syntax (as GWT does). Therefore, you should be able to use the more functional .NET languages with it – F#, for instance, is an ML implementation for .NET that appears well supported by MS. For that matter, C# 3 is already among the most functional mainstream languages.

The beta version is only available for Visual Studio.NET 2008 – currently available if you have an MSDN subscription, but not yet available for purchase.